Re: [RFC PATCH 1/6] sched/uclamp: Track uclamped util_avg in sched_avg

From: Hongyan Xia
Date: Tue Dec 05 2023 - 09:24:30 EST


On 04/12/2023 16:07, Vincent Guittot wrote:
On Wed, 4 Oct 2023 at 11:05, Hongyan Xia <Hongyan.Xia2@xxxxxxx> wrote:

From: Hongyan Xia <hongyan.xia2@xxxxxxx>

Track a uclamped version of util_avg in sched_avg, which clamps util_avg
within [uclamp[UCLAMP_MIN], uclamp[UCLAMP_MAX]] every time util_avg is
updated. At the CFS rq level, cfs_rq->avg.util_avg_uclamp must always be
the sum of all util_avg_uclamp of entities on this cfs_rq. So, each
time the util_avg_uclamp of an entity gets updated, we also track the
delta and update the cfs_rq.

No please don't do that. Don't start to mix several different signals
in one. Typically uclamp_min doen't say you want to use at least this
amount of cpu capacity.

But I'd say with uclamp, PELT is already broken and is a mixed signal anyway. When uclamp is used, a util_avg value X doesn't mean X, it means X under some rq uclamp, and the worst part is that the rq uclamp may or may not have anything to do with this task's uclamp.

Pretending X is a true PELT value now (and not a mixed value) is why we have so many problems today. For example in the frequency spike problem, if a task A has no idle time under uclamp_max, its PELT does not reflect reality. The moment another task B comes in and uncap the rq uclamp_max, the current scheduler code thinks the 1024 of A means a real 1024, which is wrong and is exactly why we see a spike when B joins. It's also why we need to special case 0 spare capacity with recent patches, because rq util_avg has lost its original PELT meaning under uclamp.

Because X is not the true PELT, we always have to do something to bring it back into reality. What the current max aggregation code does is to introduce corner cases, like treating 0 spare capacity as potential opportunity to queue more tasks (which creates further problems in my tests), and maybe introducing uclamp load balancing special cases in the future, or introducing uclamp filtering to delay the effect of wrong PELT values.

What this series does is not much different. We keep the somewhat wrong value X, but we also remember under what uclamp values did we get that X to bring things back into reality, which means now we have [X, uclamp_min when X happens, uclamp_max when X happens]. To save space, this becomes [X, clamped X], which is what this series does. The original PELT value X is kept, but we use the clamped X in several places to improve our decisions.


With tasks with a util_avg of 10 but a uclamp min of 1024 what does it
mean when you start to sum this value ?

Like I replied in another comment, assuming a uclamp_min of 1024 is a hint to run the task on the big CPUs, I don't think it's right to directly use uclamp as a CPU placement indicator. A uclamp value may come from ADFP from userspace. An ADPF uclamp_min value of little CPU capacity + 1 certainly doesn't mean a game on purpose wants to avoid the little core. It simply means it wants at least this much performance, and whether this results in placing the game thread on a big CPU is purely the job of EAS (or CAS, etc.). We want to use little CPU + 1 as uclamp_min because we know the SoC and the little CPU is bad, but uclamp should be generic and should not rely on knowing the SoC.

Basically, under sum aggregation one would not use a uclamp_min value of 1024 to place a small task on a big core. A uclamp_min of 1024 under sum aggregation has the meaning in ADPF, which is a hint to try to run me as fast as possible.



[...]