Re: [PATCH v2] sched/fair: Revert boost in cpu_util()

From: Vincent Guittot

Date: Thu Jun 04 2026 - 04:52:16 EST


On Thu, 4 Jun 2026 at 10:21, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
>
> On 6/4/2026 3:42 PM, Vincent Guittot wrote:
> > On Thu, 28 May 2026 at 04:36, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
> >>
> >> From: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
> >>
> >> We have seen a massive power consumption regression (20% SoC power
> >> increase in many apps) after updating our kernel. After bisection we
> >
> > It's always good to provide more details: kernel, version, hardware
> > and the test condition
>
> This is on Dimensity 8400 SoC (1 big 3 mid 4 little). We just updated to
> the Android GKI kernel 6.6 which picked up this patch. We removed all
> scheduler vendor stuff so scheduler-wise this is close to an upstream
> 6.6 kernel. We run common Android apps on schedutil.
>
> I will add the above in the commit message.
>
> >> pinpointed the regression to the cpu_util(boost) feature. After
> >> reverting the boost feature the massive energy regression is gone.
> >> Detailed trace analysis down below. The regression is found across quite
> >> many apps but Youtube is one of the worst offenders. Some energy
> >> benchmark numbers are here.
> >>
> >> Youtube 1080p60fps video benchmark:
> >> FPS SoC Power diff
> >> w/ boost 59.94 913.6mW
> >> w/o boost 59.93 720.4mW -21.15%
> >>
> >> Mobile Legends (gaming)
> >> FPS sdev Total power diff
> >> w/ boost 120.16 0.47 3294.10mW
> >> w/o boost 120.07 0.56 2996.09mW -9.05%
> >>
> >> Genshin Impact (gaming, medium quality)
> >> FPS sdev Total power diff
> >> w/ boost 60.05 0.34 6215.84mW
> >> w/o boost 60.03 0.35 5695.46mW -8.37%
> >>
> >> Signed-off-by: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
> >>
> >> ---
> >> Changed in v2:
> >> - Sync all comments with code changes.
> >> - Update commit message with more benchmark numbers.
> >>
> >> Analysis:
> >>
> >> We found several problems that result in the power spike:
> >>
> >> 1. Arithmetic should not happen between util_avg and runnable_avg:
> >>
> >> After util = max(util, runnable) which potentially picks runnable value
> >> in cpu_util(), we then add or subtract task util values from it. This
> >> produces a value that is half-runnable-half-util which is ill-defined.
> >> This alone should be a warning sign. This breaks EAS calculations in
> >> many cases, leading to sub-optimal task placements.
> >
> > This can be easily fixed
>
> I thought about adding or subtracting runnable_avg instead, but that is
> still wrong. Given three tasks each with 100 util, if they wake up at
> the same time and running on the same rq, their util is 100, 100, 100,
> rq total util is 300. Their runnable_avg is 100, 200, 300, rq total
> runnable_avg is 600. If the 1st task leaves the rq, the remaining two
> task runnable_avg will then become 100, 200, giving a total rq
> runnable_avg of 300. However, subtracting the runnable_avg of the 1st
> task gives 600 - 100 = 500, which is very wrong.

Substracting/adding se.avg.runnable_avg is still the right solution
because this is what will happen if the task migrate

>
> I failed to find a way to fix this. The root cause is that it is
> impossible to know how much contention there is between the task you are
> adding or subtracting and other tasks on the same rq.
>
> >>
> >> 2. Using the absolute value of runnable_avg to drive frequency is
> >> too high to be reasonable:
> >>
> >> Schedutil use runnable in a _relative_ way to util to know whether there
> >> is contention in several places. However, the _absolute_ value should
> >> not be used like util. Runnable_avg tends to be significantly higher,
> >> making it much easier to saturate frequency.
> >>
> >> For example, if three tasks each with a util of 100 contend on the same
> >> rq, the rq util is 300 but runnable_avg shoots up to 600, which is often
> >> much higher than needed.
> >
> > In the email thread of the prev version, you said that using
> > runnable_avg is good but not like the current implementation. So
> > instead of blindly reverting it, please submit a better usage, as this
> > was added to fix some performance issues.
>
> There might be some confusion in the prev email. I was saying that I
> agree we can use runnable_avg in comparison with util_avg to detect
> contention, but using the absolute runnable_avg in EAS and frequency
> selection is questionable.
>
> Actually my personal belief after many failed attempts is that using the
> absolute value of runnable_avg in place of util will just not work.
>
> >>
> >> 3. Runnable_avg may not even reflect true contention:
> >>
> >> When tasks are dependent, the bottleneck is often the data flow between
> >> tasks, not the contention seen by runnable_avg. Boosting frequency with
> >> runnable in such scenarios wastes power without performance benefits.
> >>
> >> We found 1 has minor power regression but 2 and 3 regresses power
> >> significantly. We have seen multiple applications with the
> >> producer-consumer model with many worker threads suffer. When there is
> >> IPC between producer and consumer, boosting frequency blindly does not
> >> help performance at all if consumer is limited by how much data is flown
> >> through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
> >> total SoC power regression of 20% shown in the results above.
> >
> > Tasks contention is a real problem and runnable_avg is one metric that
> > reflects this.
>
> I agree, and it is used in places like util_est_update() to detect
> contention by comparing it with util_avg. I'm just saying the raw
> absolute value of runnable_avg is not a good metric to use in EAS and
> CPUFreq, regressing power by more than 20% in apps without much gain.
>
> >>
> >> ---
> >> kernel/sched/cpufreq_schedutil.c | 2 +-
> >> kernel/sched/fair.c | 34 ++++++++------------------------
> >> kernel/sched/sched.h | 1 -
> >> 3 files changed, 9 insertions(+), 28 deletions(-)
> >>
> >> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> >> index ae9fd211cec1..ba867192513b 100644
> >> --- a/kernel/sched/cpufreq_schedutil.c
> >> +++ b/kernel/sched/cpufreq_schedutil.c
> >> @@ -228,7 +228,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> >> unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
> >>
> >> if (!scx_switched_all())
> >> - util += cpu_util_cfs_boost(sg_cpu->cpu);
> >> + util += cpu_util_cfs(sg_cpu->cpu);
> >> util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> >> util = max(util, boost);
> >> sg_cpu->bw_min = min;
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 728965851842..ecf8b4860951 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -8192,7 +8192,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >> * @cpu: the CPU to get the utilization for
> >> * @p: task for which the CPU utilization should be predicted or NULL
> >> * @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
> >> - * @boost: 1 to enable boosting, otherwise 0
> >> *
> >> * The unit of the return value must be the same as the one of CPU capacity
> >> * so that CPU utilization can be compared with CPU capacity.
> >> @@ -8210,12 +8209,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >> * be when a long-sleeping task wakes up. The contribution to CPU utilization
> >> * of such a task would be significantly decayed at this point of time.
> >> *
> >> - * Boosted CPU utilization is defined as max(CPU runnable, CPU utilization).
> >> - * CPU contention for CFS tasks can be detected by CPU runnable > CPU
> >> - * utilization. Boosting is implemented in cpu_util() so that internal
> >> - * users (e.g. EAS) can use it next to external users (e.g. schedutil),
> >> - * latter via cpu_util_cfs_boost().
> >> - *
> >> * CPU utilization can be higher than the current CPU capacity
> >> * (f_curr/f_max * max CPU capacity) or even the max CPU capacity because
> >> * of rounding errors as well as task migrations or wakeups of new tasks.
> >> @@ -8226,19 +8219,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >> * though since this is useful for predicting the CPU capacity required
> >> * after task migrations (scheduler-driven DVFS).
> >> *
> >> - * Return: (Boosted) (estimated) utilization for the specified CPU.
> >> + * Return: (Estimated) utilization for the specified CPU.
> >> */
> >> static unsigned long
> >> -cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> >> +cpu_util(int cpu, struct task_struct *p, int dst_cpu)
> >> {
> >> struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> >> unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
> >> - unsigned long runnable;
> >> -
> >> - if (boost) {
> >> - runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
> >> - util = max(util, runnable);
> >> - }
> >>
> >> /*
> >> * If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
> >> @@ -8295,12 +8282,7 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> >>
> >> unsigned long cpu_util_cfs(int cpu)
> >> {
> >> - return cpu_util(cpu, NULL, -1, 0);
> >> -}
> >> -
> >> -unsigned long cpu_util_cfs_boost(int cpu)
> >> -{
> >> - return cpu_util(cpu, NULL, -1, 1);
> >> + return cpu_util(cpu, NULL, -1);
> >> }
> >>
> >> /*
> >> @@ -8322,7 +8304,7 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
> >> if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
> >> p = NULL;
> >>
> >> - return cpu_util(cpu, p, -1, 0);
> >> + return cpu_util(cpu, p, -1);
> >> }
> >>
> >> /*
> >> @@ -8489,7 +8471,7 @@ static inline void eenv_pd_busy_time(struct energy_env *eenv,
> >> int cpu;
> >>
> >> for_each_cpu(cpu, pd_cpus) {
> >> - unsigned long util = cpu_util(cpu, p, -1, 0);
> >> + unsigned long util = cpu_util(cpu, p, -1);
> >>
> >> busy_time += effective_cpu_util(cpu, util, NULL, NULL);
> >> }
> >> @@ -8513,7 +8495,7 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
> >>
> >> for_each_cpu(cpu, pd_cpus) {
> >> struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
> >> - unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
> >> + unsigned long util = cpu_util(cpu, p, dst_cpu);
> >> unsigned long eff_util, min, max;
> >>
> >> /*
> >> @@ -8675,7 +8657,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >> if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> >> continue;
> >>
> >> - util = cpu_util(cpu, p, cpu, 0);
> >> + util = cpu_util(cpu, p, cpu);
> >> cpu_cap = capacity_of(cpu);
> >>
> >> /*
> >> @@ -11848,7 +11830,7 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
> >> break;
> >>
> >> case migrate_util:
> >> - util = cpu_util_cfs_boost(i);
> >> + util = cpu_util_cfs(i);
> >>
> >> /*
> >> * Don't try to pull utilization from a CPU with one
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index 9f63b15d309d..1c934dd126b2 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -3551,7 +3551,6 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
> >>
> >>
> >> extern unsigned long cpu_util_cfs(int cpu);
> >> -extern unsigned long cpu_util_cfs_boost(int cpu);
> >>
> >> static inline unsigned long cpu_util_rt(struct rq *rq)
> >> {
> >> --
> >> 2.47.3
> >>
>