Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu_util()

From: Peter Zijlstra
Date: Thu Jul 16 2020 - 11:43:54 EST


On Thu, Jul 16, 2020 at 03:24:37PM +0100, Lukasz Luba wrote:
> On 7/16/20 12:56 PM, Peter Zijlstra wrote:

> > The second attempts to guesstimate power, and is the subject of this
> > patch.
> >
> > Currently cpufreq_cooling appears to estimate the CPU energy usage by
> > calculating the percentage of idle time using the per-cpu cpustat stuff,
> > which is pretty horrific.
>
> Even worse, it then *samples* the *current* CPU frequency at that
> particular point in time and assumes that when the CPU wasn't idle
> during that period - it had *this* frequency...

*whee* :-)

...

> In EM we keep power values in the array and these values grow
> exponentially. Each OPP has it corresponding
>
> P_x = C (V_x)^2 f_x , where x is the OPP id thus corresponding V,f
>
> so we have discrete power values, growing like:
>
> ^(power)
> |
> |
> | *
> |
> |
> | *
> | |
> | * |
> | | <----- power estimation function
> | * | should not use linear 'util/max_util'
> | * | relation here *
> |_______________________|_____________> (freq)
> opp0 opp1 opp2 opp3 opp4
>
> What is the problem
> First:
> We need to pick the right Power from the array. I would suggest
> to pick the max allowed frequency for that whole period, because
> we don't know if the CPUs were using it (it's likely).
> Second:
> Then we have the utilization, which can be considered as:
> 'idle period & running period with various freq inside', lets
> call it avg performance in that whole period.
> Third:
> Try to estimate the power used in that whole period having
> the avg performance and max performance.
>
> What you are suggesting is to travel that [*] line in
> non-linear fashion, but in (util^3)/(max_util^3). Which means
> it goes down faster when the utilization drops.
> I think it is too aggressive, e.g.
> 500^3 / 1024^3 = 0.116 <--- very little, ~12%
> 200^3 / 300^3 = 0.296
>
> Peter could you confirm if I understood you correct?

Correct, with the caveat that we might try and regression fit a 3rd
order polynomial to a bunch of EM data to see if there's a 'better'
function to be had than a raw 'f(x) := x^3'.

> This is quite important bit for me.

So, if we assume schedutil + EM, we can actually have schedutil
calculate a running power sum. That is, something like: \Int P_x dt.
Because we know the points where OPP changes.

Although, thinking more, I suspect we need tighter integration with
cpuidle, because we don't actually have idle times here, but that should
be doable.

But for anything other than schedutil + EM, things become more
interesting, because then we need to guesstimate power usage without the
benefit of having actual power numbers.

We can of course still do that running power sum, with whatever P(u) or
P(f) end up with, I suppose.

> > Another point is that cpu_util() vs turbo is a bit iffy, and to that,
> > things like x86-APERF/MPERF and ARM-AMU got mentioned. Those might also
> > have the benefit of giving you values that match your own sampling
> > interval (100ms), where the sched stuff is PELT (64,32.. based).
> >
> > So what I've been thinking is that cpufreq drivers ought to be able to
> > supply this method, and only when they lack, can the cpufreq-governor
> > (schedutil) install a fallback. And then cpufreq-cooling can use
> > whatever is provided (through the cpufreq interfaces).
> >
> > That way, we:
> >
> > 1) don't have to export anything
> > 2) get arch drivers to provide something 'better'
> >
> >
> > Does that sounds like something sensible?
> >
>
> Yes, make sense. Please also keep in mind that this
> utilization somehow must be mapped into power in a proper way.
> I am currently working on addressing all of these problems
> (including this correlation).

Right, so that mapping util to power was what I was missing and
suggesting we do. So for 'simple' hardware we have cpufreq events for
frequency change, and cpuidle events for idle, and with EM we can simply
sum the relevant power numbers.

For hardware lacking EM, or hardware managed DVFS, we'll have to fudge
things a little. How best to do that is up in the air a little, but
virtual power curves seem a useful tool to me.

The next problem for IPA is having all the devices report power in the
same virtual unit I suppose, but I'll leave that to others ;-)