Re: [RFC PATCH 1/6] sched/uclamp: Track uclamped util_avg in sched_avg

From: Vincent Guittot
Date: Tue Dec 05 2023 - 11:22:56 EST


On Tue, 5 Dec 2023 at 15:24, Hongyan Xia <hongyan.xia2@xxxxxxx> wrote:
>
> On 04/12/2023 16:07, Vincent Guittot wrote:
> > On Wed, 4 Oct 2023 at 11:05, Hongyan Xia <Hongyan.Xia2@xxxxxxx> wrote:
> >>
> >> From: Hongyan Xia <hongyan.xia2@xxxxxxx>
> >>
> >> Track a uclamped version of util_avg in sched_avg, which clamps util_avg
> >> within [uclamp[UCLAMP_MIN], uclamp[UCLAMP_MAX]] every time util_avg is
> >> updated. At the CFS rq level, cfs_rq->avg.util_avg_uclamp must always be
> >> the sum of all util_avg_uclamp of entities on this cfs_rq. So, each
> >> time the util_avg_uclamp of an entity gets updated, we also track the
> >> delta and update the cfs_rq.
> >
> > No please don't do that. Don't start to mix several different signals
> > in one. Typically uclamp_min doen't say you want to use at least this
> > amount of cpu capacity.
>
> But I'd say with uclamp, PELT is already broken and is a mixed signal

PELT has nothing to do with uclamp, you could argue that EAS is making
a wrong use or mix of PELT signals and uclamp hints to select a CPU
but PELT itself is not impacted by uclamp and should stay out of
uclamp policy.

> anyway. When uclamp is used, a util_avg value X doesn't mean X, it means
> X under some rq uclamp, and the worst part is that the rq uclamp may or
> may not have anything to do with this task's uclamp.

I think that you are mixing PELT and how it is(was) used by EAS.

Have you looked at the latest tip/sched/core and the changes in
effective_cpu_util(int cpu, unsigned long util_cfs, unsigned long
*min, unsigned long *max) ?

We return 3 values:
- the actual utilization which is not impacted by uclamp
- a targeted min value which takes into account uclamp_min
- a targeted max value which takes into account uclamp_max

https://lore.kernel.org/lkml/20231122133904.446032-1-vincent.guittot@xxxxxxxxxx/

>
> Pretending X is a true PELT value now (and not a mixed value) is why we
> have so many problems today. For example in the frequency spike problem,
> if a task A has no idle time under uclamp_max, its PELT does not reflect
> reality. The moment another task B comes in and uncap the rq uclamp_max,

you are mixing 2 things. The PELT signal of the task is correct.

> the current scheduler code thinks the 1024 of A means a real 1024, which
> is wrong and is exactly why we see a spike when B joins. It's also why

This is the true actual utilization, the actual util_avg of A is
really 1024. But users want to give a hint to lower its needs with
uclamp_max.

> we need to special case 0 spare capacity with recent patches, because rq
> util_avg has lost its original PELT meaning under uclamp.
>
> Because X is not the true PELT, we always have to do something to bring
> it back into reality. What the current max aggregation code does is to
> introduce corner cases, like treating 0 spare capacity as potential
> opportunity to queue more tasks (which creates further problems in my
> tests), and maybe introducing uclamp load balancing special cases in the
> future, or introducing uclamp filtering to delay the effect of wrong
> PELT values.
>
> What this series does is not much different. We keep the somewhat wrong
> value X, but we also remember under what uclamp values did we get that X
> to bring things back into reality, which means now we have [X,
> uclamp_min when X happens, uclamp_max when X happens]. To save space,
> this becomes [X, clamped X], which is what this series does. The
> original PELT value X is kept, but we use the clamped X in several
> places to improve our decisions.
>
> >
> > With tasks with a util_avg of 10 but a uclamp min of 1024 what does it
> > mean when you start to sum this value ?
>
> Like I replied in another comment, assuming a uclamp_min of 1024 is a
> hint to run the task on the big CPUs, I don't think it's right to

not especially to run on a big but to say that the task needs more
performance than what its actual utilization looks

> directly use uclamp as a CPU placement indicator. A uclamp value may
> come from ADFP from userspace. An ADPF uclamp_min value of little CPU
> capacity + 1 certainly doesn't mean a game on purpose wants to avoid the
> little core. It simply means it wants at least this much performance,
> and whether this results in placing the game thread on a big CPU is
> purely the job of EAS (or CAS, etc.). We want to use little CPU + 1 as
> uclamp_min because we know the SoC and the little CPU is bad, but uclamp
> should be generic and should not rely on knowing the SoC.
>
> Basically, under sum aggregation one would not use a uclamp_min value of
> 1024 to place a small task on a big core. A uclamp_min of 1024 under sum
> aggregation has the meaning in ADPF, which is a hint to try to run me as
> fast as possible.
>
> >
> >
> >> [...]