Re: [PATCH v2] sched/fair: Revert boost in cpu_util()

From: Christian Loehle

Date: Fri Jun 05 2026 - 05:17:42 EST


On 6/5/26 03:48, Hongyan Xia wrote:
> [snip]
>
> Hi Dietmar,
>
> On 6/4/2026 9:21 PM, Dietmar Eggemann wrote:
>> On 04.06.26 11:21, Hongyan Xia wrote:
>>> On 6/4/2026 4:48 PM, Vincent Guittot wrote:
>>>> On Thu, 4 Jun 2026 at 10:21, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
>>>>>
>>>>> On 6/4/2026 3:42 PM, Vincent Guittot wrote:
>>>>>> On Thu, 28 May 2026 at 04:36, Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> From: Hongyan Xia <hongyan.xia@xxxxxxxxxxxxx>
>>
>> [...]
>>
>>>>>>> Analysis:
>>>>>>>
>>>>>>> We found several problems that result in the power spike:
>>>>>>>
>>>>>>> 1. Arithmetic should not happen between util_avg and runnable_avg:
>>>>>>>
>>>>>>> After util = max(util, runnable) which potentially picks runnable value
>>>>>>> in cpu_util(), we then add or subtract task util values from it. This
>>>>>>> produces a value that is half-runnable-half-util which is ill-defined.
>>>>>>> This alone should be a warning sign. This breaks EAS calculations in
>>>>>>> many cases, leading to sub-optimal task placements.
>>>>>>
>>>>>> This can be easily fixed
>>>>>
>>>>> I thought about adding or subtracting runnable_avg instead, but that is
>>>>> still wrong. Given three tasks each with 100 util, if they wake up at
>>>>> the same time and running on the same rq, their util is 100, 100, 100,
>>>>> rq total util is 300. Their runnable_avg is 100, 200, 300, rq total
>>>>> runnable_avg is 600. If the 1st task leaves the rq, the remaining two
>>>>> task runnable_avg will then become 100, 200, giving a total rq
>>>>> runnable_avg of 300. However, subtracting the runnable_avg of the 1st
>>>>> task gives 600 - 100 = 500, which is very wrong.
>>>>
>>>> Substracting/adding se.avg.runnable_avg is still the right solution
>>>> because this is what will happen if the task migrate
>>>
>>> One difference is that before runnable boost, runnable_avg really
>>> doesn't affect EAS and CPUFreq much. Runnable boost is the first
>>> instance where we directly use raw runnable_avg values in EAS and
>>> frequency selection, and this value is often too high to be reasonable.
>>> I'm mostly arguing that we should use it in proper places (like the one
>>> in util_est_update()) and not here.
>>>
>>> Actually this is the very first fix we tried internally. There is minor
>>> improvement in EAS spreading out tasks, but energy regression is pretty
>>> much the same, and Youtube is still at 20% regression.
>>
>> I thought the issue was that, in your low-power test cases, most of the
>> tasks involved in these contention scenarios are raising CPU frequency,
>> but they are not directly contributing to the workload (including
>> Android Graphics Pipeline (AGP) progress). If that's the case, the
>> additional power consumption is essentially wasted.
>>
>> Could you collect some information about these contention events and
>> identify the tasks involved?
>
> Our profiling shows that there is a very typical scenario in these
> energy regressions, which is IPC.
>
> Quite a lot of apps have per-CPU worker threads taking data from
> producers. When a producer sends the first piece of data (mostly using
> wake_up_sync()), a worker thread on the same CPU gets woken up, but that
> worker is not immediately switched to, which is right because you want
> the producer to finish producing data. But, because that worker thread
> is already runnable, it starts to shoot up its runnable_avg and the rq
> runnable_avg. When the producer is done and we finally context switch to
> the worker, the frequency is already super high and it takes time for
> the frequency to decay.
>
> Two problems with this:
>
> 1. In this common scenario, the producer and per-CPU consumer threads
> are not really 'contending'. They are accomplishing the same job but
> just split into multiple stages. The 'contention' here should be much
> smaller than two independent threads doing two separate jobs.

I would have expected this to be relatively rare, since feec() should
normally place the consumer on another CPU in the same cluster according
to max-spare-capacity. The producer should already have consumed some spare
capacity on its current CPU, so another CPU should look preferable.

Why does this not happen in practice? Is the consumer wakeup not going
through that path for some other reason, or is the spare-capacity delta
too small?

On your 4+3+1 SoC, does this “dominated by runnable_avg” scenario happen
mostly on a specific cluster, or is it fairly evenly distributed?

> [snip]