RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

From: Doug Smythies
Date: Thu Sep 26 2019 - 12:31:47 EST


All,

Recall the extensive tests and work done between 2018.10.11 and 2019.01.16
on Rafael's teo governor, versions 1 through 11, noting that versions
9, 10, and 11 did not contain functional changes.

Hi Rafael,

If the deepest idle state is disabled, the system
can become somewhat unstable, with anywhere between no problem
at all, to the occasional temporary jump using a lot more
power for a few seconds, to a permanent jump using a lot more
power continuously. I have been unable to isolate the exact
test load conditions under which this will occur. However,
temporarily disabling and then enabling other idle states
seems to make for a somewhat repeatable test. It is important
to note that the issue occurs with only ever disabling the deepest
idle state, just not reliably.

I want to know how you want to proceed before I do a bunch of
regression testing.

On 2018.12.11 03:50 Rafael J. Wysocki wrote:

> v7 -> v8:
> * Apply the selection rules to the idle deepest state as well as to
> the shallower ones (the deepest idle state was treated differently
> before by mistake).
> * Subtract 1/2 of the exit latency from the measured idle duration
> in teo_update() (instead of subtracting the entire exit latency).
> This makes the idle state selection be slightly more performance-
> oriented.

I have isolated the issue to a subset of the v7 to v8 changes, however
it was not the exit latency changes.

The partial revert to V7 changes I made were (on top of 5.3):

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index 7d05efd..d2892bb 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -283,9 +283,28 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
if (idx < 0)
idx = i; /* first enabled state */

- if (s->target_residency > duration_us)
- break;
+ if (s->target_residency > duration_us) {
+ /*
+ * If the "hits" metric of the state matching the sleep
+ * length is greater than its "misses" metric, that is
+ * the one to use.
+ */
+ if (cpu_data->states[idx].hits > cpu_data->states[idx].misses)
+ break;

+ /*
+ * It is more likely that one of the shallower states
+ * will match the idle duration measured after wakeup,
+ * so take the one with the maximum "early hits" metric,
+ * but if that cannot be determined, just use the state
+ * selected so far.
+ */
+ if (max_early_idx >= 0) {
+ idx = max_early_idx;
+ duration_us = drv->states[idx].target_residency;
+ }
+ break;
+ }
if (s->exit_latency > latency_req) {
/*
* If we break out of the loop for latency reasons, use
@@ -294,7 +313,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
* as long as that target residency is low enough.
*/
duration_us = drv->states[idx].target_residency;
- goto refine;
+ break;
}

idx = i;
@@ -307,21 +326,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
}
}

- /*
- * If the "hits" metric of the idle state matching the sleep length is
- * greater than its "misses" metric, that is the one to use. Otherwise,
- * it is more likely that one of the shallower states will match the
- * idle duration observed after wakeup, so take the one with the maximum
- * "early hits" metric, but if that cannot be determined, just use the
- * state selected so far.
- */
- if (cpu_data->states[idx].hits <= cpu_data->states[idx].misses &&
- max_early_idx >= 0) {
- idx = max_early_idx;
- duration_us = drv->states[idx].target_residency;
- }
-
-refine:
if (idx < 0) {
idx = 0; /* No states enabled. Must use 0. */
} else if (idx > 0) {

Test results (note: I use my own monitoring utility, because it feeds my
graphing/html utility, but have used turbostat here in case others
want to repeat the test):

Processor: i7-2600K
Deepest idle state: 4 (C6)
The system is mostly idle for these tests.
turbostat command used:

sudo turbostat --Summary --quiet --hide
IRQ,Avg_MHz,SMI,GFXMHz,TSC_MHz,GFXWatt,CorWatt,CPU%c1,CPU%c7,CoreTmp,GFX%rc6,Pkg%pc2,Pkg%pc3,Pkg%pc6,C1,C1E,,C1%,C1E%,C3%,C6%
--interval 3

Only because it works so reliably, the test involves disabling all idle states
deeper than 0, then enabling all idle states shallower than the deepest.

I have attempted to minimize the number of display columns,
hopefully without hiding useful information.

Kernel 5.3:

Busy% Bzy_MHz POLL C3 C6 POLL% CPU%c3 CPU%c6 PkgTmp PkgWatt
0.04 1600 0 0 232 0.00 0.00 99.85 28 3.73
0.05 1600 0 3 254 0.00 0.00 99.82 28 3.73
0.04 1600 0 1 264 0.00 0.00 99.82 30 3.73
61.08 3493 14735 16 258 60.84 0.04 38.57 45 31.00
100.00 3500 24057 0 0 99.88 0.00 0.00 46 48.73
100.00 3500 24064 0 0 99.86 0.00 0.00 48 48.87
100.00 3500 24057 0 0 99.88 0.00 0.00 49 49.07
100.00 3500 24070 0 0 99.87 0.00 0.00 50 49.21
100.00 3500 24068 0 0 99.87 0.00 0.00 50 49.29
100.00 3500 24063 0 0 99.88 0.00 0.00 52 49.42
100.00 3500 24060 0 0 99.87 0.00 0.00 53 49.58
100.00 3500 24076 0 0 99.87 0.00 0.00 53 49.70
100.00 3500 24067 0 0 99.87 0.00 0.00 55 49.77
28.89 3575 15138 261 0 28.71 59.97 0.00 48 27.93
10.52 3666 11884 379 0 10.43 75.24 0.00 47 20.72
10.80 3680 12141 558 0 10.65 75.13 0.00 46 21.11
11.51 3742 13016 217 0 11.43 74.93 0.00 47 22.48
10.84 3696 12254 294 0 10.76 74.89 0.00 48 21.26
10.34 3652 11683 243 0 10.25 74.93 0.00 47 20.36
10.67 3681 12065 240 0 10.59 74.94 0.00 48 20.96
10.68 3671 12067 278 0 10.59 74.92 0.00 47 20.85
10.84 3690 12259 236 0 10.76 74.94 0.00 47 21.24
11.51 3735 13023 265 0 11.43 74.93 0.00 48 22.43

Notice how once idle state 3 is enabled again (as were states 1 and 2) and we
should be at about 5 watts, there is still a ton of time spent in idle state
0 (POLL).

Kernel 5.3 + above patch:

Busy% Bzy_MHz POLL C3 C6 POLL% CPU%c3 CPU%c6 PkgTmp PkgWatt
0.04 1600 0 3 235 0.00 0.00 99.82 26 3.71
0.04 1600 0 1 252 0.00 0.00 99.79 26 3.71
0.05 1600 0 3 273 0.00 0.00 99.79 26 3.72
0.04 1600 0 3 224 0.00 0.00 99.83 26 3.71
0.05 1600 0 1 252 0.00 0.00 99.81 25 3.71
0.04 1600 0 1 231 0.00 0.00 99.84 26 3.71
0.05 1600 0 3 294 0.00 0.00 99.77 28 3.71
0.04 1600 0 2 259 0.00 0.00 99.81 25 3.70
69.61 3490 16773 15 326 69.33 0.14 29.98 42 34.41
100.00 3500 24079 0 0 99.88 0.00 0.00 45 48.27
100.00 3500 24062 0 0 99.89 0.00 0.00 45 48.41
100.00 3500 24056 0 0 99.88 0.00 0.00 46 48.52
100.00 3500 24064 0 0 99.89 0.00 0.00 47 48.63
27.31 3498 6588 340 0 27.16 72.40 0.00 32 17.13
0.03 1600 1 230 0 0.00 99.91 0.00 30 5.08
0.03 1600 2 234 0 0.00 99.90 0.00 31 5.07
0.03 1600 5 238 0 0.00 99.91 0.00 30 5.07
0.03 1600 3 230 0 0.00 99.91 0.00 30 5.05
0.03 1600 3 246 0 0.00 99.91 0.00 29 5.05
0.03 1600 3 235 0 0.00 99.90 0.00 30 5.04

Notice how once idle state 3 is enabled again (as were states 1 and 2) almost
all time is spent there, as expected, and the power is the expected 5 watts.

... Doug