Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints

From: Vincent Guittot
Date: Thu Sep 07 2023 - 11:36:16 EST


On Tue, 29 Aug 2023 at 18:37, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>
> On 08/29/23 16:35, Vincent Guittot wrote:
> > On Sun, 20 Aug 2023 at 23:08, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> > >
> > > DVFS headroom is applied after we calculate the effective_cpu_util()
> > > which is where we honour uclamp constraints. It makes more sense to
> > > apply the headroom there once and let all users naturally get the right
> > > thing without having to sprinkle the call around in various places.
> >
> > You have to take care of not mixing scheduler and cpufreq constraint and policy.
> >
> > uclamp is a scheduler feature to highlight that the utilization
> > requirement of a task can't go above a level.
>
> uclamp is a performance hint, which utilization is how we represent it
> internally. A task with uclamp of 800 is not the same as util being actually
> 800. In our previous discussions around util_fits_cpu() we had similar
> discussion on how the two can't be treated the same.
>
> Same with uclamp_min; if it is set to 1024 but there is a task with util say
> 100, this task shouldn't cause overutilized as its actual utilization actually
> fits, but it just asked to run at a higher performance point. The actual
> utilization has to be in the over utilized range. Unless uclamp_max forces it
> to fit of course.
>
> So uclamp_max sets a cap to the performance that the task is allowed to reach.
> And this translates into frequency selection and cpu selection (in case of HMP)
> by design.

The dvfs head room is a sole property of cpufreq and the scheduler
doesn't care about it. Scheduler should provide some cpu utilization
hints. Then what cpufreq will do based on the utilization, the HW
constraint and the policy is out of scheduler scope.

This headroom is there only because energy model wants to forecast
what will be the frequency that cpufreq will select later on but
scheduler doesn't care

>
> I don't think we're mixing matters here. But the headroom should be applied to
> actual utilization, not uclamp. If the rq is capped to a specific performance
> level, we should honour this.
>
> We do the same with iowait boost where it is capped by uclamp_max. A task
> capped by uclamp_max shouldn't be the trigger of running at a frequency higher
> than this point. Otherwise we'd need to expose all these internal details to
> the user, which I think we all agree isn't something to consider at all.
>
> >
> > dvfs head room is a cpufreq decision to anticipate either hw
> > limitation and responsiveness problem or performance hints.
> >
> > they come from different sources and rational and this patch mixed
> > them which i'm not sure is a correct thing to do
>
> I don't think I'm mixing them up to be honest.
>
> The governor is driven by effective_cpu_util() to tell it what is the
> effective_cpu_util() when making frequency selection. This function takes into
> account all the factors that could impact frequency selection including all type
> of rq pressures (except thermal). I think it is appropriate to take headroom
> into account there and make sure we honour uclamp_max hint to cap the
> performance appropriately based on the effective uclamp_max value of the rq.
>
> For example if actually util was 640, then after the headroom it'd be 800. And
> if uclamp_max is 800, then this task will still get the 1.25 headroom. We are
> not changing this behavior.
>
> But if the task goes beyond that, then it'll stop at 800 because this what the
> request is all about. A headroom beyond this point is not required because the
> task (or rq to be more precise) is capped to this performance point and
> regardless how busy the cpu gets in terms of real utilization or headroom, it
> shouldn't go beyond this point. ie: if a task is a 1024 but uclamp_max of is
> 800 then it'll only get a performance equivalent to OPP@800 would give.
>
> If we don't do that uclamp_max range effectively is from 0-819 (based on
> current headroom setting of 1.25). Which is not what we want or what we're
> telling userspace. Userspace sees the whole system performance levels
> abstracted from 0 - 1024. As it should. The only effect they would observe and
> there's no way around it is that OPPs are discrete points. So in reality our
> 0-1024 is a staircase where a range of util values will map to the same OPP and
> then we'll get a jump. So the user can end up requesting for example 700 and
> 720 and not notice any difference because they both map to the same OPP.
> I don't think we can fix that - but maybe I should add it to the uclamp doc as
> a caveat when setting uclamp.
>
> >
> > >
> > > Before this fix running
> > >
> > > uclampset -M 800 cat /dev/zero > /dev/null
> > >
> > > Will cause the test system to run at max freq of 2.8GHz. After the fix
> > > it runs at 2.2GHz instead which is the correct value that matches the
> > > capacity of 800.
> >
> > So a task with an utilization of 800 will run at higher freq than a
> > task clamped to 800 by uclamp ? Why should they run at different freq
> > for the same utilization ?
>
> Because uclamp sets an upper limit on the performance level the task should be
> able to achieve. Imagine a task is 1024 and capped to 800, it should not run at
> max frequency, right? What's the point of the uclamp_max hint if the headroom

Who knows ? Here we try to predict the coming utilization of the cpu
and this prediction is only partial so cpufreq might want to keep some
headroom even if uclamp max is applied to a cfs rq. Anticipate
unpredictable irq stolen time

IMO, dvfs_headroom should never be visible in scheduler code. This is
the case now; it's only visible by cpufreq which is the "owner" and
energy model which tries to predict cpufreq behavior

> will cause it to run at max anyway? We lost the meaning of the hint. And if
> this headroom changes in the future, people will start observing different
> behavior for existing uclamp_max settings on the same system because of this
> this rightfully hidden and unaccounted for factor.
>
> >
> > >
> > > Note that similar problem exist for uclamp_min. If util was 50, and
> > > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp
> > > constraints, we'll end up with util of 125 instead of 100. IOW, we get
> > > boosted twice, first time by uclamp_min, and second time by dvfs
> > > headroom.
> > >
> > > Signed-off-by: Qais Yousef (Google) <qyousef@xxxxxxxxxxx>
> > > ---
> > > include/linux/energy_model.h | 1 -
> > > kernel/sched/core.c | 11 ++++++++---
> > > kernel/sched/cpufreq_schedutil.c | 5 ++---
> > > 3 files changed, 10 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> > > index 6ebde4e69e98..adec808b371a 100644
> > > --- a/include/linux/energy_model.h
> > > +++ b/include/linux/energy_model.h
> > > @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
> > > scale_cpu = arch_scale_cpu_capacity(cpu);
> > > ps = &pd->table[pd->nr_perf_states - 1];
> > >
> > > - max_util = apply_dvfs_headroom(max_util);
> > > max_util = min(max_util, allowed_cpu_cap);
> > > freq = map_util_freq(max_util, ps->frequency, scale_cpu);
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index efe3848978a0..441d433c83cd 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > > * frequency will be gracefully reduced with the utilization decay.
> > > */
> > > util = util_cfs + cpu_util_rt(rq);
> > > - if (type == FREQUENCY_UTIL)
> > > + if (type == FREQUENCY_UTIL) {
> > > + util = apply_dvfs_headroom(util);
> >
> > This is not the same as before because utilization has not being
> > scaled by irq steal time yet
>
> We do the scaling below, no?
>
> AFAICS, we had:
>
> (util_cfs + util_rt + irq + ((max-irq)*(util_cfs + util_rt)/max)+ dl_bw) * scale

I think it's only :
((util_cfs + util_rt) * (max -irq) / max + irq + dl_bw) * scale

and it' s effectively similar

>
> Using U = (util_cfs + util_rt) * scale
>
> we can write this after the multiplication
>
> U + irq * scale + ((max-irq)*U/max) + dl_bw * scale
>
> >
> > > util = uclamp_rq_util_with(rq, util, p);
> > > + }
> > >
> > > dl_util = cpu_util_dl(rq);
> > >
> > > @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > > * max - irq
> > > * U' = irq + --------- * U
> > > * max
> > > + *
> > > + * We only need to apply dvfs headroom to irq part since the util part
> > > + * already had it applied.
> > > */
> > > util = scale_irq_capacity(util, irq, max);
> > > - util += irq;
> > > + util += type == FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq;
> > >
> > > /*
> > > * Bandwidth required by DEADLINE must always be granted while, for
> > > @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > > * an interface. So, we only do the latter for now.
> > > */
> > > if (type == FREQUENCY_UTIL)
> > > - util += cpu_bw_dl(rq);
> > > + util += apply_dvfs_headroom(cpu_bw_dl(rq));
> >
> > If we follow your reasoning with uclamp on the dl bandwidth, should we
> > not skip this as well ?
>
> I do remove this in patch 4. Can fold that one into this one if you like.
> I opted to keep the current behavior in this patch and remove these later in
> patch 4.
>
> I do think that both RT and DL shouldn't need DVFS headroom in general as they
> both 'disable' it by default.

and the scheduler in general doesn't care about DVFS headroom.

I don't think it's the right way to fix your problem of selecting the
right frequency on your platform. It would be better to get a proper
interface between EM and cpufreq like
cpufreq_give_me_freq_for_this_utilization_ctx.

And cpufreq needs 2 information:
- the utilization of the cpu
- the uclamp/performance/bandwidth hints on this cpu because the
uclamp_max hints are screwed up by irq scaling anyway.

For example:

irq = 256
util = 800 but uclamp=512

If I follow your reasoning about the uclamp being a performance hint
and uclamp=512 means that we should not go above 512 just for cfs+rt,
we should stay at 512 in this case because irq only needs 256 which is
below the 512 bandwidth. The utilization reported to cpufreq will be
above 512 whatever the current (720) formula or your proposal (608).
In the case of uclamp, it should be applied after having been scaled
by irq time.

So we should have reported utilization of 720 with a bandwidth
requirement of 512 and then cpufreq can applies its headroom if needed

>
>
> Thanks!
>
> --
> Qais Yousef