Re: [PATCH v2] sched/fair: Revert boost in cpu_util()

From: Hongyan Xia

Date: Thu Jun 04 2026 - 04:29:47 EST


On 6/4/2026 3:42 PM, Vincent Guittot wrote:
> On Thu, 28 May 2026 at 04:36, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
>>
>> From: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
>>
>> We have seen a massive power consumption regression (20% SoC power
>> increase in many apps) after updating our kernel. After bisection we
>
> It's always good to provide more details: kernel, version, hardware
> and the test condition

This is on Dimensity 8400 SoC (1 big 3 mid 4 little). We just updated to
the Android GKI kernel 6.6 which picked up this patch. We removed all
scheduler vendor stuff so scheduler-wise this is close to an upstream
6.6 kernel. We run common Android apps on schedutil.

I will add the above in the commit message.

>> pinpointed the regression to the cpu_util(boost) feature. After
>> reverting the boost feature the massive energy regression is gone.
>> Detailed trace analysis down below. The regression is found across quite
>> many apps but Youtube is one of the worst offenders. Some energy
>> benchmark numbers are here.
>>
>> Youtube 1080p60fps video benchmark:
>> FPS SoC Power diff
>> w/ boost 59.94 913.6mW
>> w/o boost 59.93 720.4mW -21.15%
>>
>> Mobile Legends (gaming)
>> FPS sdev Total power diff
>> w/ boost 120.16 0.47 3294.10mW
>> w/o boost 120.07 0.56 2996.09mW -9.05%
>>
>> Genshin Impact (gaming, medium quality)
>> FPS sdev Total power diff
>> w/ boost 60.05 0.34 6215.84mW
>> w/o boost 60.03 0.35 5695.46mW -8.37%
>>
>> Signed-off-by: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
>>
>> ---
>> Changed in v2:
>> - Sync all comments with code changes.
>> - Update commit message with more benchmark numbers.
>>
>> Analysis:
>>
>> We found several problems that result in the power spike:
>>
>> 1. Arithmetic should not happen between util_avg and runnable_avg:
>>
>> After util = max(util, runnable) which potentially picks runnable value
>> in cpu_util(), we then add or subtract task util values from it. This
>> produces a value that is half-runnable-half-util which is ill-defined.
>> This alone should be a warning sign. This breaks EAS calculations in
>> many cases, leading to sub-optimal task placements.
>
> This can be easily fixed

I thought about adding or subtracting runnable_avg instead, but that is
still wrong. Given three tasks each with 100 util, if they wake up at
the same time and running on the same rq, their util is 100, 100, 100,
rq total util is 300. Their runnable_avg is 100, 200, 300, rq total
runnable_avg is 600. If the 1st task leaves the rq, the remaining two
task runnable_avg will then become 100, 200, giving a total rq
runnable_avg of 300. However, subtracting the runnable_avg of the 1st
task gives 600 - 100 = 500, which is very wrong.

I failed to find a way to fix this. The root cause is that it is
impossible to know how much contention there is between the task you are
adding or subtracting and other tasks on the same rq.

>>
>> 2. Using the absolute value of runnable_avg to drive frequency is
>> too high to be reasonable:
>>
>> Schedutil use runnable in a _relative_ way to util to know whether there
>> is contention in several places. However, the _absolute_ value should
>> not be used like util. Runnable_avg tends to be significantly higher,
>> making it much easier to saturate frequency.
>>
>> For example, if three tasks each with a util of 100 contend on the same
>> rq, the rq util is 300 but runnable_avg shoots up to 600, which is often
>> much higher than needed.
>
> In the email thread of the prev version, you said that using
> runnable_avg is good but not like the current implementation. So
> instead of blindly reverting it, please submit a better usage, as this
> was added to fix some performance issues.

There might be some confusion in the prev email. I was saying that I
agree we can use runnable_avg in comparison with util_avg to detect
contention, but using the absolute runnable_avg in EAS and frequency
selection is questionable.

Actually my personal belief after many failed attempts is that using the
absolute value of runnable_avg in place of util will just not work.

>>
>> 3. Runnable_avg may not even reflect true contention:
>>
>> When tasks are dependent, the bottleneck is often the data flow between
>> tasks, not the contention seen by runnable_avg. Boosting frequency with
>> runnable in such scenarios wastes power without performance benefits.
>>
>> We found 1 has minor power regression but 2 and 3 regresses power
>> significantly. We have seen multiple applications with the
>> producer-consumer model with many worker threads suffer. When there is
>> IPC between producer and consumer, boosting frequency blindly does not
>> help performance at all if consumer is limited by how much data is flown
>> through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
>> total SoC power regression of 20% shown in the results above.
>
> Tasks contention is a real problem and runnable_avg is one metric that
> reflects this.

I agree, and it is used in places like util_est_update() to detect
contention by comparing it with util_avg. I'm just saying the raw
absolute value of runnable_avg is not a good metric to use in EAS and
CPUFreq, regressing power by more than 20% in apps without much gain.

>>
>> ---
>> kernel/sched/cpufreq_schedutil.c | 2 +-
>> kernel/sched/fair.c | 34 ++++++++------------------------
>> kernel/sched/sched.h | 1 -
>> 3 files changed, 9 insertions(+), 28 deletions(-)
>>
>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> index ae9fd211cec1..ba867192513b 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -228,7 +228,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>> unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
>>
>> if (!scx_switched_all())
>> - util += cpu_util_cfs_boost(sg_cpu->cpu);
>> + util += cpu_util_cfs(sg_cpu->cpu);
>> util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
>> util = max(util, boost);
>> sg_cpu->bw_min = min;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 728965851842..ecf8b4860951 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8192,7 +8192,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> * @cpu: the CPU to get the utilization for
>> * @p: task for which the CPU utilization should be predicted or NULL
>> * @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
>> - * @boost: 1 to enable boosting, otherwise 0
>> *
>> * The unit of the return value must be the same as the one of CPU capacity
>> * so that CPU utilization can be compared with CPU capacity.
>> @@ -8210,12 +8209,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> * be when a long-sleeping task wakes up. The contribution to CPU utilization
>> * of such a task would be significantly decayed at this point of time.
>> *
>> - * Boosted CPU utilization is defined as max(CPU runnable, CPU utilization).
>> - * CPU contention for CFS tasks can be detected by CPU runnable > CPU
>> - * utilization. Boosting is implemented in cpu_util() so that internal
>> - * users (e.g. EAS) can use it next to external users (e.g. schedutil),
>> - * latter via cpu_util_cfs_boost().
>> - *
>> * CPU utilization can be higher than the current CPU capacity
>> * (f_curr/f_max * max CPU capacity) or even the max CPU capacity because
>> * of rounding errors as well as task migrations or wakeups of new tasks.
>> @@ -8226,19 +8219,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> * though since this is useful for predicting the CPU capacity required
>> * after task migrations (scheduler-driven DVFS).
>> *
>> - * Return: (Boosted) (estimated) utilization for the specified CPU.
>> + * Return: (Estimated) utilization for the specified CPU.
>> */
>> static unsigned long
>> -cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>> +cpu_util(int cpu, struct task_struct *p, int dst_cpu)
>> {
>> struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
>> unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
>> - unsigned long runnable;
>> -
>> - if (boost) {
>> - runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
>> - util = max(util, runnable);
>> - }
>>
>> /*
>> * If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
>> @@ -8295,12 +8282,7 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>>
>> unsigned long cpu_util_cfs(int cpu)
>> {
>> - return cpu_util(cpu, NULL, -1, 0);
>> -}
>> -
>> -unsigned long cpu_util_cfs_boost(int cpu)
>> -{
>> - return cpu_util(cpu, NULL, -1, 1);
>> + return cpu_util(cpu, NULL, -1);
>> }
>>
>> /*
>> @@ -8322,7 +8304,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
>> if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
>> p = NULL;
>>
>> - return cpu_util(cpu, p, -1, 0);
>> + return cpu_util(cpu, p, -1);
>> }
>>
>> /*
>> @@ -8489,7 +8471,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
>> int cpu;
>>
>> for_each_cpu(cpu, pd_cpus) {
>> - unsigned long util = cpu_util(cpu, p, -1, 0);
>> + unsigned long util = cpu_util(cpu, p, -1);
>>
>> busy_time += effective_cpu_util(cpu, util, NULL, NULL);
>> }
>> @@ -8513,7 +8495,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
>>
>> for_each_cpu(cpu, pd_cpus) {
>> struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
>> - unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
>> + unsigned long util = cpu_util(cpu, p, dst_cpu);
>> unsigned long eff_util, min, max;
>>
>> /*
>> @@ -8675,7 +8657,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>> if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>> continue;
>>
>> - util = cpu_util(cpu, p, cpu, 0);
>> + util = cpu_util(cpu, p, cpu);
>> cpu_cap = capacity_of(cpu);
>>
>> /*
>> @@ -11848,7 +11830,7 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
>> break;
>>
>> case migrate_util:
>> - util = cpu_util_cfs_boost(i);
>> + util = cpu_util_cfs(i);
>>
>> /*
>> * Don't try to pull utilization from a CPU with one
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 9f63b15d309d..1c934dd126b2 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -3551,7 +3551,6 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>>
>>
>> extern unsigned long cpu_util_cfs(int cpu);
>> -extern unsigned long cpu_util_cfs_boost(int cpu);
>>
>> static inline unsigned long cpu_util_rt(struct rq *rq)
>> {
>> --
>> 2.47.3
>>