Re: [PATCH v2] sched/fair: Revert boost in cpu_util()

From: Hongyan Xia

Date: Fri Jun 05 2026 - 05:39:31 EST


On 6/5/2026 5:02 PM, Christian Loehle wrote:
> On 6/5/26 03:48, Hongyan Xia wrote:
>> [snip]
>>
>> Hi Dietmar,
>>
>> On 6/4/2026 9:21 PM, Dietmar Eggemann wrote:
>>> On 04.06.26 11:21, Hongyan Xia wrote:
>>>> On 6/4/2026 4:48 PM, Vincent Guittot wrote:
>>>>> On Thu, 4 Jun 2026 at 10:21, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>> On 6/4/2026 3:42 PM, Vincent Guittot wrote:
>>>>>>> On Thu, 28 May 2026 at 04:36, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> From: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
>>>
>>> [...]
>>>
>>>>>>>> Analysis:
>>>>>>>>
>>>>>>>> We found several problems that result in the power spike:
>>>>>>>>
>>>>>>>> 1. Arithmetic should not happen between util_avg and runnable_avg:
>>>>>>>>
>>>>>>>> After util = max(util, runnable) which potentially picks runnable value
>>>>>>>> in cpu_util(), we then add or subtract task util values from it. This
>>>>>>>> produces a value that is half-runnable-half-util which is ill-defined.
>>>>>>>> This alone should be a warning sign. This breaks EAS calculations in
>>>>>>>> many cases, leading to sub-optimal task placements.
>>>>>>>
>>>>>>> This can be easily fixed
>>>>>>
>>>>>> I thought about adding or subtracting runnable_avg instead, but that is
>>>>>> still wrong. Given three tasks each with 100 util, if they wake up at
>>>>>> the same time and running on the same rq, their util is 100, 100, 100,
>>>>>> rq total util is 300. Their runnable_avg is 100, 200, 300, rq total
>>>>>> runnable_avg is 600. If the 1st task leaves the rq, the remaining two
>>>>>> task runnable_avg will then become 100, 200, giving a total rq
>>>>>> runnable_avg of 300. However, subtracting the runnable_avg of the 1st
>>>>>> task gives 600 - 100 = 500, which is very wrong.
>>>>>
>>>>> Substracting/adding se.avg.runnable_avg is still the right solution
>>>>> because this is what will happen if the task migrate
>>>>
>>>> One difference is that before runnable boost, runnable_avg really
>>>> doesn't affect EAS and CPUFreq much. Runnable boost is the first
>>>> instance where we directly use raw runnable_avg values in EAS and
>>>> frequency selection, and this value is often too high to be reasonable.
>>>> I'm mostly arguing that we should use it in proper places (like the one
>>>> in util_est_update()) and not here.
>>>>
>>>> Actually this is the very first fix we tried internally. There is minor
>>>> improvement in EAS spreading out tasks, but energy regression is pretty
>>>> much the same, and Youtube is still at 20% regression.
>>>
>>> I thought the issue was that, in your low-power test cases, most of the
>>> tasks involved in these contention scenarios are raising CPU frequency,
>>> but they are not directly contributing to the workload (including
>>> Android Graphics Pipeline (AGP) progress). If that's the case, the
>>> additional power consumption is essentially wasted.
>>>
>>> Could you collect some information about these contention events and
>>> identify the tasks involved?
>>
>> Our profiling shows that there is a very typical scenario in these
>> energy regressions, which is IPC.
>>
>> Quite a lot of apps have per-CPU worker threads taking data from
>> producers. When a producer sends the first piece of data (mostly using
>> wake_up_sync()), a worker thread on the same CPU gets woken up, but that
>> worker is not immediately switched to, which is right because you want
>> the producer to finish producing data. But, because that worker thread
>> is already runnable, it starts to shoot up its runnable_avg and the rq
>> runnable_avg. When the producer is done and we finally context switch to
>> the worker, the frequency is already super high and it takes time for
>> the frequency to decay.
>>
>> Two problems with this:
>>
>> 1. In this common scenario, the producer and per-CPU consumer threads
>> are not really 'contending'. They are accomplishing the same job but
>> just split into multiple stages. The 'contention' here should be much
>> smaller than two independent threads doing two separate jobs.
>
> I would have expected this to be relatively rare, since feec() should
> normally place the consumer on another CPU in the same cluster according
> to max-spare-capacity. The producer should already have consumed some spare
> capacity on its current CPU, so another CPU should look preferable.
>
> Why does this not happen in practice? Is the consumer wakeup not going
> through that path for some other reason, or is the spare-capacity delta
> too small?

The Binder IPC mechanism in Android and many apps have per-CPU consumer
threads. The producer does wake_up_sync() instead of a normal wake-up on
a consumer thread on the *same* CPU. This is to avoid moving data across
CPUs because the producer is expected to yield soon to the consumer.

> On your 4+3+1 SoC, does this “dominated by runnable_avg” scenario happen
> mostly on a specific cluster, or is it fairly evenly distributed?

Fairly even. In Youtube playback, the three clusters all have around
40-50% higher average frequency, leading to a 20% power regression.

>> [snip]