Re: [PATCH] sched/fair: Revert boost in cpu_util()

From: Qais Yousef

Date: Tue May 19 2026 - 09:37:24 EST

On 05/19/26 02:41, hongyan.xia(夏弘彦) wrote:
> On 5/19/2026 9:17 AM, Qais Yousef wrote:
> > On 05/18/26 11:37, hongyan.xia(夏弘彦) wrote:
> >> On 5/18/2026 6:04 PM, Christian Loehle wrote:
> >>> [Some people who received this message don't often get email from christian.loehle@xxxxxxx. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>
> >>> On 5/18/26 03:40, hongyan.xia(夏弘彦) wrote:
> >>>> From: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
> >>>>
> >>>> We have seen a massive power consumption regression (20% SoC power
> >>>> increase in many apps) after updating our kernel. After bisection we
> >>>> pinpointed the regression to the cpu_util(boost) feature. After
> >>>> reverting the boost feature the massive energy regression is gone.
> >>>> Detailed trace analysis down below. The regression is found across quite
> >>>> many apps but Youtube is one of the worst offenders, shown in the
> >>>> 1080p60fps video benchmark:
> >>>>
> >>>> Setup FPS SoC Power (mW) diff
> >>>> w/ boost 59.94 913.6
> >>>> w/o boost 59.93 720.4 -21.15%
> >>>>
> >>>> Signed-off-by: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
> >>>>
> >>>> ---
> >>>> Analysis:
> >>>>
> >>>> We found several problems that result in the power spike:
> >>>>
> >>>> 1. Arithmetic should not happen between util_avg and runnable_avg:
> >>>>
> >>>> After util = max(util, runnable) which potentially picks runnable value
> >>>> in cpu_util(), we then add or subtract task util values from it. This
> >>>> produces a value that is half-runnable-half-util which is ill-defined.
> >>>> This alone should be a warning sign. This breaks EAS calculations in
> >>>> many cases, leading to sub-optimal task placements.
> >
> > I don't think it does. The util signal itself has issues too :)
>
> One issue I found is that it sometimes piles up tasks on the same CPU,
> because rq.runnable_avg - task.util_avg is still very high and not much
> lower than rq.runnable_avg, making EAS think there is no benefit in
> spreading out tasks when other CPUs are empty.
>
> But this problem is usually temporary and doesn't last long in reality.

I see. I think the major problem with this logic is that runnable is useful
only during this transient time. But it will take a long time to decay which
I think (guess really) what causes these problems you're observing. The
contention has gone, but the signal can take 50-100ms to resolve to previous
behavior - I think.

>
> >>>>
> >>>> 2. Using the absolute value of runnable_avg to drive frequency is
> >>>> too high to be reasonable:
> >>>>
> >>>> We use runnable in a _relative_ way to util to know whether there is
> >>>> contention in several places. However, the _absolute_ value should not
> >>>> be used like util. Runnable_avg tends to be significantly higher,
> >>>> making it much easier to saturate frequency.
> >>>>
> >>>> For example, if three tasks each with a util of 100 contend on the same
> >>>> rq, the rq util is 300 but runnable_avg shoots up to 900. 900 drives the
> >>>> CPU at the max frequency, and it's highly questionable whether this
> >>>> boost is the right decision.
> >
> > I think this is the idea. These tasks are waiting behind other tasks.
> >
> >>>>
> >>>> 3. Runnable_avg may not even reflect true contention:
> >>>>
> >>>> When tasks are dependent, the bottleneck is often the data flow between
> >>>> tasks, not the contention seen by runnable_avg. Boosting frequency with
> >>>> runnable in such scenarios wastes power without performance benefits.
> >
> > I believe contention is used to describe several tasks fighting for CPU time
> > but only a single task can run and the other will be waiting. But I think
> > I know what you mean, I think this is the same I was highlighting in [1].
> > We don't care if some tasks end up waiting for more.
> >
> >>>>
> >>>> We found 1 has minor power regression but 2 and 3 regresses power
> >>>> significantly. We have seen multiple applications with the
> >>>> producer-consumer model with many worker threads suffer. When there is
> >>>> IPC between producer and consumer, boosting frequency blindly does not
> >>>> help performance at all if consumer is limited by how much data is flown
> >>>> through. Youtube suffer from 1, 2 and 3 at the same time, leading to a
> >>>> total SoC power regression of 20% shown in the results above.
> >>>
> >>> We did discuss removing runnable boost internally as well, but I’d love to see
> >>> more data too.
> >>> The original issue it was trying to solve was avoiding jank frames during load
> >>> spikes, which YouTube does not really exercise. Some gaming workload data would
> >>> therefore be a useful addition here.
> >>
> >> Although I would be glad to provide more data (after more benchmarks and
> >> pending our internal approval), I wonder, what level of performance gain
> >> do we expect from this feature to justify the big energy regression?
> >>
> >>> Runnable boost was considered as an alternative to approaches like reducing the
> >>> PELT half-life and similar changes. Qais’ current ideas also try to tackle this
> >>> problem, of course, so +CC.
> >
> > A lot of the current behavior is actually good for power by accident. And this
> > runnable approach helps performance as a workaround to these issues. We need to
> > defer some decisions to userspace and just give them a better way to decide
> > their trade-offs. One person's regression is another person's gain..
>
> To be honest, yes, we live in a world where many things work by accident
> and there are definitely a lot of 'accidents' in schedutil. Our
> motivation for this patch is mostly our real world test scenarios that
> mimic customer day of use patterns, and it looks like the perf gain is
> small compared with the energy regression across common apps.
>
> >>>
> >>> If you have run many workloads, do you also have data on where this feature actually
> >>> helped, especially in reducing jank frames?
> >>
> >> We ran our Day of Use (DoU, including Facebook, Youtube and other
> >> popular apps) test model and we did see a 6.6% increase in jank frames
> >> after the revert. Dropped frames went up from 106 to 113 in a total of
> >> 70210 frames. However, in our test model there is no way an increase of
> >> 7 frames within 70210 justifies the energy regression between 10% and
> >> 20% in a lot of apps, hence for us the trade-off decision is very clear
> >> here.
> >>
> >> Another question from me is, if this feature has potentially buggy
> >> corners or mathematical unsoundness (mostly the half-util-half-runnable
> >> value inside cpu_util()), should we rely on its performance gain?
> >>
> >>>
> >>> Some discussion from back then:
> >>> https://lore.kernel.org/lkml/20230406155030.1989554-1-dietmar.eggemann@xxxxxxx/
> >>> https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@xxxxxxx/
> >
> > Generally I remember I had concerns on this approach then [1]. I kept quite
> > after it got merged and won't complain if it is removed now.
> >
> > [1] https://lore.kernel.org/lkml/20230504152328.twh3rqgq2o2gvd4u@airbuntu/
>
> I must say I'm now almost completely echoing what you were saying. Sad
> that I didn't see this thread back then. Our test results confirmed the
> concerns in that thread, namely:
>
> 1. Whether it's a global win: The performance gain seems limited, like
> the jank results (not with Jankbench, but actual animations animated by
> common apps) I just shared with Christian.
> 2. Hurts power: Yes, we saw a dramatic 20% SoC power increase in certain
> apps like Youtube playback.
> 3. Being selective: This is also our concern. In our analysis, looks
> like it boosts frequency often in cases where it doesn't help perf.
>
> Sad that these questions are answered 3 years later, but better late
> than never :)

:)