Re: [PATCH 2/2] thermal: cpufreq_cooling: Reuse effective_cpu_util()
From: Viresh Kumar
Date: Thu Oct 22 2020 - 04:33:03 EST
Hi Peter,
Since Lukasz asked me to hold on to this stuff so he can propose
something in its place, I stayed away from discussing this patchset
for sometime. But now that he agrees [1] that we may take this forward
and he can work on top of it as and when he can, I am looking to find
the way out to get this stuff merged in some form.
On 16-07-20, 13:56, Peter Zijlstra wrote:
> So there's a number of things... let me recap a bunch of things that
> got mentioned on IRC earlier this week and then continue from there..
>
> So IPA* (or any other thermal governor) needs energy estimates for the
> various managed devices, cpufreq_cooling, being the driver for the CPU
> device, needs to provide that and in return receives feedback on how
> much energy it is allowed to consume, cpufreq_cooling then dynamically
> enables/disables OPP states.
>
> There are actually two methods the thermal governor will use:
> get_real_power() and get_requested_power().
>
> The first isn't used anywhere in mainline, but could be implemented on
> hardware that has energy counters (like say x86 RAPL).
>
> The second attempts to guesstimate power, and is the subject of this
> patch.
>
> Currently cpufreq_cooling appears to estimate the CPU energy usage by
> calculating the percentage of idle time using the per-cpu cpustat stuff,
> which is pretty horrific.
>
> This patch then attempts to improve upon that by using the scheduler's
> cpu_util(ENERGY_UTIL) estimate, which is also used to select OPP state
> and improves upon avg idle. This should be a big improvement as higher
> frequency consumes more energy
Exactly and that's the motivation behind this change.
> , but should we not also consider that:
>
> E = C V^2 f
>
> The EAS energy model has tables for the OPPs that contain this, but in
> this case we seem to be assuming a linear enery/frequency curve, which
> is just not the case.
>
> I suppose we could do something like **:
>
> 100 * util^3 / max^3
>
> which assumes V~f.
>
> Another point is that cpu_util() vs turbo is a bit iffy, and to that,
> things like x86-APERF/MPERF and ARM-AMU got mentioned. Those might also
> have the benefit of giving you values that match your own sampling
> interval (100ms), where the sched stuff is PELT (64,32.. based).
I believe the above stuff is more around additional improvements that
we can do over this change, and probably Lukasz was looking to do
that.
> So what I've been thinking is that cpufreq drivers ought to be able to
> supply this method, and only when they lack, can the cpufreq-governor
> (schedutil) install a fallback.
One of the issues I see with this is that schedutil may not be
available in all configurations and it is still absolutely fine to
using the suggested helper to get the energy numbers in such cases, so
we shouldn't really make it scheutil dependent.
> And then cpufreq-cooling can use
> whatever is provided (through the cpufreq interfaces).
>
> That way, we:
>
> 1) don't have to export anything
Yeah, I understand that the exports are annoying and specially this
line :)
+#include "../../kernel/sched/sched.h"
But this is the best I could think of then as we don't want to export
them for anyone else to use sched specific stuff. And there was other
core kernel stuff that was already doing it :)
> 2) get arch drivers to provide something 'better'
>
>
> Does that sounds like something sensible?
--
viresh
[1] http://lore.kernel.org/lkml/d2a75b18-1eae-f396-4dc5-af41b539e581@xxxxxxx