Re: [RFC PATCH 0/1] cpuidle: teo: Introduce optional util-awareness

From: Doug Smythies
Date: Sat Sep 17 2022 - 18:52:26 EST


On Thu, Sep 15, 2022 at 9:45 AM Kajetan Puchalski
<kajetan.puchalski@xxxxxxx> wrote:
>
> Hi,

Hi,

I tried it.

>
> At the moment, all the available idle governors operate mainly based on their own past performance
> without taking into account any scheduling information. Especially on interactive systems, this
> results in them frequently selecting a deeper idle state and then waking up before its target
> residency is hit, thus leading to increased wakeup latency and lower performance with no power
> saving. For 'menu' while web browsing on Android for instance, those types of wakeups ('too deep')
> account for over 24% of all wakeups.
>
> At the same time, on some platforms C0 can be power efficient enough to warrant wanting to prefer
> it over C1. Sleeps that happened in C0 while they could have used C1 ('too shallow') only save
> less power than they otherwise could have. Too deep sleeps, on the other hand, harm performance
> and nullify the potential power saving from using C1 in the first place. While taking this into
> account, it is clear that on balance it is preferable for an idle governor to have more too shallow
> sleeps instead of more too deep sleeps.
>
> Currently the best available governor under this metric is TEO which on average results in less than
> half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and
> increased performance in the process.
>
> This proposed optional extension to TEO would specifically tune it for minimising too deep
> sleeps and minimising latency to achieve better performance. To this end, before selecting the next
> idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the
> CPU is being utilized. This util value is then compared to a threshold defined as a percentage of
> the cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation).

That seems quite a bit too low to me. However on my processor the
energy cost of using
idle state 0 verses anything deeper is very high, so I do not have a
good way to test.

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
On an idle system :
with only Idle state 0 enabled, processor package power is ~46 watts.
with only idle state 1 enabled, processor package power is ~2.6 watts
with all idle states enabled, processor package power is ~1.4 watts

> If the util is above the
> threshold, the governor directly selects the shallowest available idle state. If the util is below
> the threshold, the governor defaults to the TEO metrics mechanism to try to select the deepest
> available idle state based on the closest timer event and its own past correctness.
>
> Effectively this functions like a governor that on the fly disables deeper idle states when there
> are things happening on the cpu and then immediately reenables them as soon as the cpu isn't
> being utilized anymore.
>
> Initially I am sending this as a patch for TEO to visualize the proposed mechanism and simplify
> the review process. An alternative way of implementing it while not interfering
> with existing TEO code would be to fork TEO into a separate but mostly identical for the time being
> governor (working name 'idleutil') and then implement util-awareness there, so that the two
> approaches can coexist and both be available at runtime instead of relying on a compile-time option.
> I am happy to send a patchset doing that if you think it's a cleaner approach than doing it this way.

I would prefer the two to coexist for testing, as it makes it easier
to manually compare some
areas of focus.

>
> This approach can outperform all the other currently available governors, at least on mobile device
> workloads, which is why I think it is worth keeping as an option.
>
> Additionally, in my view, the reason why it makes more sense to implement this type of mechanism
> inside a governor rather than outside using something like QoS or some other way to disable certain
> idle states on the fly are the governor's metrics. If we were disabling idle states and reenabling
> them without the governor 'knowing' about it, the governor's metrics would end up being updated
> based on state selections not caused by the governor itself. This could interfere with the
> correctness of said metrics as that's not what they were designed for as far as I understand.
> This approach skips metrics updates whenever a state was selected based on the util and not based
> on the metrics.
>
> There is no particular attachment or reliance on TEO for this mechanism, I simply chose to base
> it on TEO because it performs the best out of all the available options and I didn't think there was
> any point in reinventing the wheel on the side of computing governor metrics. If a
> better approach comes along at some point, there's no reason why the same idle aware mechanism
> couldn't be used with any other metrics algorithm. That would, however, require implemeting it as
> a separate governor rather than a TEO add-on.
>
> As for how the extension performs in practice, below I'll add some benchmark results I got while
> testing this patchset.
>
> Pixel 6 (Android 12, mainline kernel 5.18):
>
> 1. Geekbench 5 (latency-sensitive, heavy load test)
>
> The values below are gmean values across 3 back to back iteration of Geekbench 5.
> As GB5 is a heavy benchmark, after more than 3 iterations intense throttling kicks in on mobile devices
> resulting in skewed benchmark scores, which makes it difficult to collect reliable results. The actual
> values for all of the governors can change between runs as the benchmark might be affected by factors
> other than just latency. Nevertheless, on the runs I've seen, util-aware TEO frequently achieved better
> scores than all the other governors.
>
> 'shallow' is a trivial governor that only ever selects the shallowest available state, included here
> for reference and to establish the lower bound of latency possible to achieve through cpuidle.
>
> 'gmean too deep %' and 'gmean too shallow %' are percentages of too deep and too shallow sleeps
> computed using the new trace event - cpu_idle_miss. The percentage is obtained by counting the two
> types of misses over the course of a run and then dividing them by the total number of wakeups.
>
> | metric | menu | teo | shallow | teo + util-aware |
> | ------------------------------------- | ------------- | --------------- | --------------- | --------------- |
> | gmean score | 2716.4 (0.0%) | 2795 (+2.89%) | 2780.5 (+2.36%) | 2830.8 (+4.21%) |
> | gmean too deep % | 16.64% | 9.61% | 0% | 4.19% |
> | gmean too shallow % | 2.66% | 5.54% | 31.47% | 15.3% |
> | gmean task wakeup latency (gb5) | 82.05μs (0.0%) | 73.97μs (-9.85%) | 42.05μs (-48.76%) | 66.91μs (-18.45%) |
> | gmean task wakeup latency (asynctask) | 75.66μs (0.0%) | 56.58μs (-25.22%) | 65.78μs (-13.06%) | 55.35μs (-26.84%) |
>
> In case of this benchmark, the difference in latency does seem to translate into better scores.
>
> Additionally, here's a set of runs of Geekbench done after holding the phone in
> the fridge for exactly an hour each time in order to minimise the impact of thermal issues.
>
> | metric | menu | teo | teo + util-aware |
> | ------------------------------------- | ------------- | --------------- | --------------- |
> | gmean multicore score | 2792.1 (0.0%) | 2845.2 (+1.9%) | 2857.4 (+2.34%) |
> | gmean single-core score | 1048.3 (0.0%) | 1052.6 (+0.41%) | 1055.3 (+0.67%) |
>
> 2. PCMark Web Browsing (non latency-sensitive, normal usage test)
>
> The table below contains gmean values across 20 back to back iterations of PCMark 2 Web Browsing.
>
> | metric | menu | teo | shallow | teo + util-aware |
> | ------------------------- | ------------- | --------------- | --------------- | --------------- |
> | gmean score | 6283.0 (0.0%) | 6262.9 (-0.32%) | 6258.4 (-0.39%) | 6323.7 (+0.65%) |
> | gmean too deep % | 24.15% | 10.32% | 0% | 3.2% |
> | gmean too shallow % | 2.81% | 7.68% | 27.69% | 17.189% |
> | gmean power usage [mW] | 209.1 (0.0%) | 187.8 (-10.17%) | 205.5 (-1.71%) | 205 (-1.96%) |
> | gmean task wakeup latency | 204.6μs (0.0%) | 184.39μs (-9.87%) | 95.55μs (-53.3%) | 95.98μs (-53.09%) |
>
> As this is a web browsing benchmark, the task for which the wakeup latency was recorded was Chrome's
> rendering task, ie CrRendererMain. The latency improvement for the actual benchmark task was very
> similar.
>
> In this case the large latency improvement does not translate into a notable increase in benchmark score as
> this particular benchmark mainly responds to changes in operating frequency. Nevertheless, the small power
> saving compared to menu with no decrease in benchmark score indicate that there are no regressions for this
> type of workload while using this governor.
>
> Note: The results above were as mentioned obtained on the 5.18 kernel. Results for Geekbench obtained after
> backporting CFS patches from the most recent mainline can be found in the pdf linked below [1].
> The results and improvements still hold up but the numbers change slightly. Additionally, the pdf contains
> plots for all the relevant results obtained with this and other idle governors.
>
> At the very least this approach seems promising so I wanted to discuss it in RFC form first.
> Thank you for taking your time to read this!

There might be a way forward for my type of processor if the algorithm
were to just reduce the idle
depth by 1 instead of all the way to idle state 0. Not sure. It seems
to bypass all that the teo
governor is attempting to achieve.

For a single periodic workflow at any work sleep frequency (well, I
test 5 hertz to 411 hertz) and very
light workload: Processor package powers for 73 hertz work/sleep frequency:

teo: ~1.5 watts
menu: ~1.5 watts
util: ~19 watts

For 12 periodic workflow threads at 73 hertz work/sleep frequency
(well, I test 5 hertz to 411 hertz) and very
workload: Processor package powers:

teo: ~2.8watts
menu: ~2.8 watts
util: ~49 watts

My test computer is a server, with no gui. I started a desktop linux
VM guest that isn't doing much:

teo: ~1.8 watts
menu: ~1.8 watts
util: ~7.8 watts

>
> --
> Kajetan
>
> [1] https://github.com/mrkajetanp/lisa-notebooks/blob/a2361a5b647629bfbfc676b942c8e6498fb9bd03/idle_util_aware.pdf
>
>
> Kajetan Puchalski (1):
> cpuidle: teo: Introduce optional util-awareness
>
> drivers/cpuidle/Kconfig | 12 +++++
> drivers/cpuidle/governors/teo.c | 86 +++++++++++++++++++++++++++++++++
> 2 files changed, 98 insertions(+)
>
> --
> 2.37.1
>