Re: [PATCH 2/4] sched: cpufreq: Fix apply_dvfs_headroom() escaping uclamp constraints

From: Qais Yousef
Date: Sun Sep 10 2023 - 13:46:47 EST


On 09/08/23 16:30, Vincent Guittot wrote:

> > Hmm IIUC, no. irq pressure will be added on top as normal. It's just cfs+rt
> > will be restricted by their uclamp settings _after_ the headroom is applied.
>
> But then you mixed some utilization level (irq pressure) which is
> time related with some performance/bandwidth limitation which is freq
> related. And I think that part of the reason why we can't agree on
> where to put headroom and how to take into account uclamp

I did not change how this part works. We can drop patch 4 completely if this is
what is causing the confusion.

What I changed is the order of applying uclamp_rq_util_with() and
apply_dvfs_headroom(). The rest is still the same as-is in the code today.

>
> >
> > > above 512 whatever the current (720) formula or your proposal (608).
> > > In the case of uclamp, it should be applied after having been scaled
> > > by irq time.
> >
> > I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers.
>
> My bad, I finally decided to use an irq pressure of 128 in my
> calculation but forgot to change the value in my email
>
> >
> > So the way I'm proposing it here
> >
> > util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
> > util = uclamp_rq_util_with(rq, util, NULL) = 512
> > util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384
> > util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704
> > util += dvfs_headroom(dl_bw) = 704
> >
> > >
> > > So we should have reported utilization of 720 with a bandwidth
> > > requirement of 512 and then cpufreq can applies its headroom if needed
> >
> > The key part I'm changing is this
> >
> > util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000
> > util = uclamp_rq_util_with(rq, util, NULL) = 512
> >
> > Before we had (assume irq, rt and dl are 0 for simplicity and a single task is
> > running)
> >
> > util = cfs = 800
> > util = uclamp_rq_util_with(rq, util, NULL) = 512
> > util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640
> >
> > So we are running higher than we are allowed to. So applying the headroom
> > after taking uclamp constraints into account is the problem.
>
> But I think it's not correct because you mixed some utilization level
> with some performance requirement. IIRC, you said that the
> uclamp/min/max must not be seen as an utilization level but a
> performance hint. In such case, you can't add it to some other
> utilization because it defeats its original purpose

But this is what we do today already. I didn't change this part. See below.

>
> >
> > IIUC and your concerns are about how scheduler and schedutil are interacting
> > loses some abstraction, then yeah this makes sense and can refactor code to
> > better abstract the two and keep them decoupled.
> >
> > But if you think the order the headroom is applied should be the way it is,
> > then this doesn't fix the problem.
> >
> > A good use case to consider is a single task running at 1024. If I want to
> > limit it to 80% using uclamp_max, how can this be enforced? With current code,
> > the task will continue to run at 1024; which is the problem.
>
> Single case is the easy part
>
> >
> > Running at 640 instead of 512 is still a problem too as this could be 1 or
> > 2 OPP higher than what we want to allow; and there could be significant power
> > difference between those OPPs.
>
> That's my point.
>
> Your proposal tries to fix the simple case of 1 task with a uclamp_max
> but If we want to move in this direction, we should fix all cases.

I think I am addressing all cases. I don't think I understand the problem
you're trying to highlight. Is the RFC patch 4 is what is causing the
confusion? That one is not related to fixing uclamp_max; it's just me
questioning the status quo of which pressure signal requires the headroom.

The main change being done here actually is to apply_dvfs_headroom() *before*
doing uclamp_rq_util_with(). I am not sure how you see this mixing.

Current code performs apply_dvfs_headroom() *after*; which what causes the CPU
to run at a performance level higher than rq->uclamp[UCLAMP_MAX].

It doesn't matter how many tasks on the rq, if rq->uclamp[UCLAMP_MAX] is set to
800, then the CPU should not vote to max (assuminig all other pressures are 0).

Handling of irq pressure etc is not changed.

> I probably need to put my idea in real code to see what it means but
> we should provide 2 value to cpufreq: an utilization level and a
> performance hint which would max aggregate the various performance
> hints that we have

Hmm. This reads to me code structure change. From my perspective; the correct
behavior is to apply the headroom before restricting the performance level.

It seems you're seeing a better way to aggregate all the pressures when taking
uclamp into account. Which I am starting to see your point if this is what
you're saying. Though I haven't changed the existing relationship. I did try
to question some things in patch 4, my thoughts were specific to headroom, but
it seems you have more generalized form.

I do agree in general it seems scheduler and schedutil are getting tightly
coupled code wise and it would be good to provide generic cpufreq interfaces to
potentially allow other governors to hook in and benefit.

Whether this is something desired or not, I don't know as I remember Peter and
Rafael wanted schedutil to grow to become the de-facto governor. It seems the
right thing anyway.


Thanks!

--
Qais Yousef