[PATCH] sched/fair: Revert boost in cpu_util()

From: hongyan.xia(夏弘彦)

Date: Sun May 17 2026 - 22:41:08 EST


From: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>

We have seen a massive power consumption regression (20% SoC power
increase in many apps) after updating our kernel. After bisection we
pinpointed the regression to the cpu_util(boost) feature. After
reverting the boost feature the massive energy regression is gone.
Detailed trace analysis down below. The regression is found across quite
many apps but Youtube is one of the worst offenders, shown in the
1080p60fps video benchmark:

Setup FPS SoC Power (mW) diff
w/ boost 59.94 913.6
w/o boost 59.93 720.4 -21.15%

Signed-off-by: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>

---
Analysis:

We found several problems that result in the power spike:

1. Arithmetic should not happen between util_avg and runnable_avg:

After util = max(util, runnable) which potentially picks runnable value
in cpu_util(), we then add or subtract task util values from it. This
produces a value that is half-runnable-half-util which is ill-defined.
This alone should be a warning sign. This breaks EAS calculations in
many cases, leading to sub-optimal task placements.

2. Using the absolute value of runnable_avg to drive frequency is
too high to be reasonable:

We use runnable in a _relative_ way to util to know whether there is
contention in several places. However, the _absolute_ value should not
be used like util. Runnable_avg tends to be significantly higher,
making it much easier to saturate frequency.

For example, if three tasks each with a util of 100 contend on the same
rq, the rq util is 300 but runnable_avg shoots up to 900. 900 drives the
CPU at the max frequency, and it's highly questionable whether this
boost is the right decision.

3. Runnable_avg may not even reflect true contention:

When tasks are dependent, the bottleneck is often the data flow between
tasks, not the contention seen by runnable_avg. Boosting frequency with
runnable in such scenarios wastes power without performance benefits.

We found 1 has minor power regression but 2 and 3 regresses power
significantly. We have seen multiple applications with the
producer-consumer model with many worker threads suffer. When there is
IPC between producer and consumer, boosting frequency blindly does not
help performance at all if consumer is limited by how much data is flown
through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
total SoC power regression of 20% shown in the results above.
---
kernel/sched/cpufreq_schedutil.c | 2 +-
kernel/sched/fair.c | 32 +++++++-------------------------
kernel/sched/sched.h | 1 -
3 files changed, 8 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index ae9fd211cec1..ba867192513b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -228,7 +228,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);

if (!scx_switched_all())
- util += cpu_util_cfs_boost(sg_cpu->cpu);
+ util += cpu_util_cfs(sg_cpu->cpu);
util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
util = max(util, boost);
sg_cpu->bw_min = min;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..86c6814121b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8192,7 +8192,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* @cpu: the CPU to get the utilization for
* @p: task for which the CPU utilization should be predicted or NULL
* @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
- * @boost: 1 to enable boosting, otherwise 0
*
* The unit of the return value must be the same as the one of CPU capacity
* so that CPU utilization can be compared with CPU capacity.
@@ -8210,12 +8209,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* be when a long-sleeping task wakes up. The contribution to CPU utilization
* of such a task would be significantly decayed at this point of time.
*
- * Boosted CPU utilization is defined as max(CPU runnable, CPU utilization).
- * CPU contention for CFS tasks can be detected by CPU runnable > CPU
- * utilization. Boosting is implemented in cpu_util() so that internal
- * users (e.g. EAS) can use it next to external users (e.g. schedutil),
- * latter via cpu_util_cfs_boost().
- *
* CPU utilization can be higher than the current CPU capacity
* (f_curr/f_max * max CPU capacity) or even the max CPU capacity because
* of rounding errors as well as task migrations or wakeups of new tasks.
@@ -8229,16 +8222,10 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* Return: (Boosted) (estimated) utilization for the specified CPU.
*/
static unsigned long
-cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
+cpu_util(int cpu, struct task_struct *p, int dst_cpu)
{
struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
- unsigned long runnable;
-
- if (boost) {
- runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
- util = max(util, runnable);
- }

/*
* If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
@@ -8295,12 +8282,7 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)

unsigned long cpu_util_cfs(int cpu)
{
- return cpu_util(cpu, NULL, -1, 0);
-}
-
-unsigned long cpu_util_cfs_boost(int cpu)
-{
- return cpu_util(cpu, NULL, -1, 1);
+ return cpu_util(cpu, NULL, -1);
}

/*
@@ -8322,7 +8304,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
p = NULL;

- return cpu_util(cpu, p, -1, 0);
+ return cpu_util(cpu, p, -1);
}

/*
@@ -8489,7 +8471,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
int cpu;

for_each_cpu(cpu, pd_cpus) {
- unsigned long util = cpu_util(cpu, p, -1, 0);
+ unsigned long util = cpu_util(cpu, p, -1);

busy_time += effective_cpu_util(cpu, util, NULL, NULL);
}
@@ -8513,7 +8495,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,

for_each_cpu(cpu, pd_cpus) {
struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
- unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
+ unsigned long util = cpu_util(cpu, p, dst_cpu);
unsigned long eff_util, min, max;

/*
@@ -8675,7 +8657,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;

- util = cpu_util(cpu, p, cpu, 0);
+ util = cpu_util(cpu, p, cpu);
cpu_cap = capacity_of(cpu);

/*
@@ -11848,7 +11830,7 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
break;

case migrate_util:
- util = cpu_util_cfs_boost(i);
+ util = cpu_util_cfs(i);

/*
* Don't try to pull utilization from a CPU with one
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d..1c934dd126b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3551,7 +3551,6 @@ static inline unsigned long cpu_util_dl(struct rq *rq)


extern unsigned long cpu_util_cfs(int cpu);
-extern unsigned long cpu_util_cfs_boost(int cpu);

static inline unsigned long cpu_util_rt(struct rq *rq)
{
--
2.47.3