Re: [RFC PATCH 0/1] sched/pelt: Change PELT halflife at runtime

From: Qais Yousef
Date: Mon Apr 03 2023 - 10:46:14 EST


Hi Diemtar

On 03/23/23 17:29, Dietmar Eggemann wrote:
> On 01/03/2023 18:24, Qais Yousef wrote:
> > On 03/01/23 11:39, Vincent Guittot wrote:
> >> On Thu, 23 Feb 2023 at 16:37, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> >>>
> >>> On 02/09/23 17:16, Vincent Guittot wrote:
> >>>
> >>>> I don't see how util_est_faster can help this 1ms task here ? It's
> >>>> most probably never be preempted during this 1ms. For such an Android
> >>>> Graphics Pipeline short task, hasn't uclamp_min been designed for and
> >>>> a better solution ?
> >>>
> >>> uclamp_min is being used in UI and helping there. But your mileage might vary
> >>> with adoption still.
> >>>
> >>> The major motivation behind this is to help things like gaming as the original
> >>> thread started. It can help UI and other use cases too. Android framework has
> >>> a lot of context on the type of workload that can help it make a decision when
> >>> this helps. And OEMs can have the chance to tune and apply based on the
> >>> characteristics of their device.
> >>>
> >>>> IIUC how util_est_faster works, it removes the waiting time when
> >>>> sharing cpu time with other tasks. So as long as there is no (runnable
> >>>> but not running time), the result is the same as current util_est.
> >>>> util_est_faster makes a difference only when the task alternates
> >>>> between runnable and running slices.
> >>>> Have you considered using runnable_avg metrics in the increase of cpu
> >>>> freq ? This takes into the runnable slice and not only the running
> >>>> time and increase faster than util_avg when tasks compete for the same
> >>>> CPU
> >>>
> >>> Just to understand why we're heading into this direction now.
> >>>
> >>> AFAIU the desired outcome to have faster rampup time (and on HMP faster up
> >>> migration) which both are tied to utilization signal.
> >>>
> >>> Wouldn't make the util response time faster help not just for rampup, but
> >>> rampdown too?
> >>>
> >>> If we improve util response time, couldn't this mean we can remove util_est or
> >>> am I missing something?
> >>
> >> not sure because you still have a ramping step whereas util_est
> >> directly gives you the final tager
>
> util_est gives us instantaneous signal at enqueue for periodic tasks,

How do you define instantaneous and periodic here? How would you describe the
behavior for non periodic tasks?

> something PELT will never be able to do.

Why? Isn't by selecting a lower pelt halflife we achieve something similar?

>
> > I didn't get you. tager?
> >
> >>
> >>>
> >>> Currently we have util response which is tweaked by util_est and then that is
> >>> tweaked further by schedutil with that 25% margin when maping util to
> >>> frequency.
> >>
> >> the 25% is not related to the ramping time but to the fact that you
> >> always need some margin to cover unexpected events and estimation
> >> error
> >
> > At the moment we have
> >
> > util_avg -> util_est -> (util_est_faster) -> util_map_freq -> schedutil filter ==> current frequency selection
> >
> > I think we have too many transformations before deciding the current
> > frequencies. Which makes it hard to tweak the system response.
>
> To me it looks more like this:
>
> max(max(util_avg, util_est), runnable_avg) -> schedutil's rate limit* -> freq. selection
> ^^^^^^^^^^^^
> new proposal to factor in root cfs_rq contention

These are still 5 stages even if written differently.

What if background tasks that are causing the contention? How can you tell it
to ignore that and NOT drive the frequency up unnecessary for those non
important ones? If userspace is fully aware of uclamp - this whole discussion
wouldn't be necessary. And I still have a bunch of fixes to push before
uclamp_max is actually usable in production.

> Like Vincent mentioned, util_map_freq() (now: map_util_perf()) is only
> there to create the safety margin used by schedutil & EAS.

Yes I know and that's not the point. The point is that it's a chain reaction.
25% percent headroom is already very aggressive and causes issues on the top
inefficient ends of the cores. And when util is high, you might end up in
a situation where you skip frequencies. Making everything go up faster without
balancing it with either enabling going down faster too or tune this value can
lead to power and thermal issues on powerful systems.

I think all we need is controlling pelt halflife and this one to tune the
system to the desired trade-off.

>
> * The schedutil up/down filter thing has been already naked in Nov 2016.
> IMHO, this is where util_est was initially discussed as an alternative.

Well, I don't see anyone not using a down filter. So I'm not sure util_est has
been a true alternative.

> We have it in mainline as well, but one value (default 10ms) for both
> directions. There was discussion to map it to the driver's
> translation_latency instead.

Which can be filled wrong sometimes :(

>
> In Pixel7 you use 0.5ms up and `5/20/20ms` down for `little/medium/big`.
>
> So on `up` your rate is as small as possible (only respecting the
> driver's translation_latency) but on `down` you use much more than that.
>
> Why exactly do you have this higher value on `down`? My hunch is
> scenarios in which the CPU (all CPUs in the freq. domain) goes idle,
> so util_est is 0 and the blocked utilization is decaying (too fast,
> 4ms (250Hz) versus 20ms?). So you don't want to ramp-up frequency
> again when the CPU wakes up in those 20ms?

The down filter prevents changing the frequency to a lower value. So it's
a holding function to keep the residency at a higher frequency for at least
20ms. It is, sort of, similar to the max() functions you used above. The max
function will allow you to follow the fasting ramping up signal on the way up,
and the slowest ramping down one on the way down.

I think this is more deterministic way to do it.

>
> >>> I think if we can allow improving general util response time by tweaking PELT
> >>> HALFLIFE we can potentially remove util_est and potentially that magic 25%
> >>> margin too.
> >>>
> >>> Why the approach of further tweaking util_est is better?
> >>
> >> note that in this case it doesn't really tweak util_est but Dietmar
> >> has taken into account runnable_avg to increase the freq in case of
> >> contention
> >>
> >> Also IIUC Dietmar's results, the problem seems more linked to the
> >> selection of a higher freq than increasing the utilization;
> >> runnable_avg tests give similar perf results than shorter half life
> >> and better power consumption.
> >
> > Does it ramp down faster too?
>
> Not sure why you are interested in this? Can't be related to the
> `driving DVFS` functionality discussed above.

If you change the reaction time to be more aggressive in going up, then it's
only natural to have it symmetrical so your residency on the power hungry OPPs
don't go over the roof and end up with thermal and power issues.

I am concerned about us biasing towrads perf first too much and not enabling
sys admins to select a proper trade off for their system and use case. Which
are not static. The workloads the system needs to accommodate to are abundant
and operating conditions could change. And the diversity of hardware available
out there is huge - I am not sure how can we expect we can have one response to
accommodate for all of them.

What I'm trying to push for here is that we should look at the chain as one
unit. And we should consider that there's important trade-off to be had here;
having a sensible default doesn't mean the user shouldn't be allowed to select
a different trade-off. I'm not sure the problem can be generalized and fixed
automatically. But happy to be proven wrong of course :-)

FWIW, I'm trying to tweak all these knobs and study their impact. Do you mind
pasting the patch for load_avg consideration so I can take it into account too
in my experiments?


Thanks!

--
Qais Yousef