Re: [PATCH 1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom

From: Qais Yousef
Date: Sat Aug 26 2023 - 16:04:27 EST


On 08/21/23 18:38, Dietmar Eggemann wrote:
> On 20/08/2023 23:06, Qais Yousef wrote:
> > We are providing headroom for the utilization to grow until the next
> > decision point to pick the next frequency. Give the function a better
> > name and give it some documentation. It is not really mapping anything.
>
> Wasn't the original aim to have a counterpart to task scheduler's
> fits_capacity(), i.e. implement a frequency tipping point at 80%?
>
> #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
>
>
> (util / max) = 0.8, hence 1.25 for the frequency-invariant case?
>
> https://lkml.kernel.org/r/11678919.CQLTrQTYxG@xxxxxxxxxxxxxx
>
> next_freq = 1.25 * max_freq * util / max
>
> 1,25 * util <-- map_util_perf()
>
> [...]
>
> Difference is that EAS deals with `util_cfs` and `capacity` whereas
> power (CPUfreq and EM) deals with `util` and `capacity_orig`. And this
> is where `capacity pressure` comes in for EAS (or fair.c).
>
> In this regard, I'm not sure why we should rename the function?

I don't see any relation between the two to be honest. Both are magical numbers
actually that no longer mean any sense in real world and people modify them in
practice, as you know.

As we brought up the topic in pelt half life and other avenues, this 25% is
a headroom for the util to grow. And this headroom is required for DVFS reasons
as evident by the current definition being in cpufreq.h.

fits_capacity() is about misfit detection - I don't see any relation to
selecting frequency. But voodoo relationship that is based on past
circumstances I don't see holding true as both code and systems have evolved
significantly since then. I think you're trying to make sure we've hit the top
frequency before we force an upmigration early. But this is artificial and not
a real necessity.

Consider for example a single task running on a medium core with a capacity of
750. It has a steady state utilization of 300. Why should it run at 25% higher
OPP which can be one or 2 freq jumps difference? And since the power cost is
not linear, the power difference could be big. Not a cost I'd pay to
artificially coordinate with misfit detection. And how about SMP systems that
don't care about misfit detection? Why should they be tied to this magical 25%
headroom?

Note that DVFS hardware is a lot better nowadays that it is common to see
a reaction time of 500us. So if this 300 task goes through a transition and its
util starts growing again, we'll react to this within 500us; that's barely
a few percents jump compared to the 25% one. I think this is what the headroom
should be about, hence the new name.

I will send the patches to address this issue separately soon enough; they're
almost ready actually. I didn't expect the name of the function to be an issue
here to be honest and thought it is clear a headroom for dvfs reaction time.


Thanks!

--
Qais Yousef