Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu_util()

From: Lukasz Luba
Date: Fri Jul 17 2020 - 05:55:48 EST

Next message: Greg KH: "Re: [git pull] habanalabs fixes pull request for kernel 5.8-rc4/5"
Previous message: Lad, Prabhakar: "Re: [PATCH] media: isif: reset global state"
In reply to: Peter Zijlstra: "Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu_util()"
Next in thread: Vincent Guittot: "Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu_util()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 7/16/20 4:43 PM, Peter Zijlstra wrote:

On Thu, Jul 16, 2020 at 03:24:37PM +0100, Lukasz Luba wrote:

On 7/16/20 12:56 PM, Peter Zijlstra wrote:

The second attempts to guesstimate power, and is the subject of this
patch.

Currently cpufreq_cooling appears to estimate the CPU energy usage by
calculating the percentage of idle time using the per-cpu cpustat stuff,
which is pretty horrific.

Even worse, it then *samples* the *current* CPU frequency at that
particular point in time and assumes that when the CPU wasn't idle
during that period - it had *this* frequency...

*whee* :-)

...

In EM we keep power values in the array and these values grow
exponentially. Each OPP has it corresponding

P_x = C (V_x)^2 f_x , where x is the OPP id thus corresponding V,f

so we have discrete power values, growing like:

^(power)
|
|
| *
|
|
| *
| |
| * |
| | <----- power estimation function
| * | should not use linear 'util/max_util'
| * | relation here *
|_______________________|_____________> (freq)
opp0 opp1 opp2 opp3 opp4

What is the problem
First:
We need to pick the right Power from the array. I would suggest
to pick the max allowed frequency for that whole period, because
we don't know if the CPUs were using it (it's likely).
Second:
Then we have the utilization, which can be considered as:
'idle period & running period with various freq inside', lets
call it avg performance in that whole period.
Third:
Try to estimate the power used in that whole period having
the avg performance and max performance.

What you are suggesting is to travel that [*] line in
non-linear fashion, but in (util^3)/(max_util^3). Which means
it goes down faster when the utilization drops.
I think it is too aggressive, e.g.
500^3 / 1024^3 = 0.116 <--- very little, ~12%
200^3 / 300^3 = 0.296

Peter could you confirm if I understood you correct?

Correct, with the caveat that we might try and regression fit a 3rd
order polynomial to a bunch of EM data to see if there's a 'better'
function to be had than a raw 'f(x) := x^3'.

I agree, I think we are on the same wavelength now.

This is quite important bit for me.

So, if we assume schedutil + EM, we can actually have schedutil
calculate a running power sum. That is, something like: \Int P_x dt.
Because we know the points where OPP changes.

Yes, that's why I was thinking about having this information stored as a
copy inside the EM, then just read it in other subsystem like: thermal,
powercap, etc.

Although, thinking more, I suspect we need tighter integration with
cpuidle, because we don't actually have idle times here, but that should
be doable.

I am scratching my head for while because of that idle issue. It opens
more dimensions to tackle.

But for anything other than schedutil + EM, things become more
interesting, because then we need to guesstimate power usage without the
benefit of having actual power numbers.

Yes, from the engineering/research perspective, platforms which do not
have EM in Linux (like Intel) are also interesting.

We can of course still do that running power sum, with whatever P(u) or
P(f) end up with, I suppose.

Another point is that cpu_util() vs turbo is a bit iffy, and to that,
things like x86-APERF/MPERF and ARM-AMU got mentioned. Those might also
have the benefit of giving you values that match your own sampling
interval (100ms), where the sched stuff is PELT (64,32.. based).

So what I've been thinking is that cpufreq drivers ought to be able to
supply this method, and only when they lack, can the cpufreq-governor
(schedutil) install a fallback. And then cpufreq-cooling can use
whatever is provided (through the cpufreq interfaces).

That way, we:

1) don't have to export anything
2) get arch drivers to provide something 'better'

Does that sounds like something sensible?

Yes, make sense. Please also keep in mind that this
utilization somehow must be mapped into power in a proper way.
I am currently working on addressing all of these problems
(including this correlation).

Right, so that mapping util to power was what I was missing and
suggesting we do. So for 'simple' hardware we have cpufreq events for
frequency change, and cpuidle events for idle, and with EM we can simply
sum the relevant power numbers.

For hardware lacking EM, or hardware managed DVFS, we'll have to fudge
things a little. How best to do that is up in the air a little, but
virtual power curves seem a useful tool to me.

The next problem for IPA is having all the devices report power in the
same virtual unit I suppose, but I'll leave that to others ;-)

True, there is more issues. There is also another movement with powercap
driven by Daniel Lezcano, which I am going to support. Maybe he would
be interested as well in having a copy of calculated energy stored
in EM. I must gather some requirements and align with him.

Thank you for your support!

Regards,
Lukasz

Next message: Greg KH: "Re: [git pull] habanalabs fixes pull request for kernel 5.8-rc4/5"
Previous message: Lad, Prabhakar: "Re: [PATCH] media: isif: reset global state"
In reply to: Peter Zijlstra: "Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu_util()"
Next in thread: Vincent Guittot: "Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu_util()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]