RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

From: Doug Smythies
Date: Sun Oct 06 2019 - 10:46:50 EST


On 2019.10.01 02:32 Rafael J. Wysocki wrote:
> On Sun, Sep 29, 2019 at 6:05 PM Doug Smythies <dsmythies@xxxxxxxxx> wrote:
>> On 2019.09.26 09:32 Doug Smythies wrote:
>>
>>> If the deepest idle state is disabled, the system
>>> can become somewhat unstable, with anywhere between no problem
>>> at all, to the occasional temporary jump using a lot more
>>> power for a few seconds, to a permanent jump using a lot more
>>> power continuously. I have been unable to isolate the exact
>>> test load conditions under which this will occur. However,
>>> temporarily disabling and then enabling other idle states
>>> seems to make for a somewhat repeatable test. It is important
>>> to note that the issue occurs with only ever disabling the deepest
>>> idle state, just not reliably.
>>>
>>> I want to know how you want to proceed before I do a bunch of
>>> regression testing.
>>
>> I did some regression testing anyhow, more to create and debug
>> a methodology than anything else.
>>
>>> On 2018.12.11 03:50 Rafael J. Wysocki wrote:
>>>
>>>> v7 -> v8:
>>>> * Apply the selection rules to the idle deepest state as well as to
>>>> the shallower ones (the deepest idle state was treated differently
>>>> before by mistake).
>>>> * Subtract 1/2 of the exit latency from the measured idle duration
>>>> in teo_update() (instead of subtracting the entire exit latency).
>>>> This makes the idle state selection be slightly more performance-
>>>> oriented.
>>>
>>> I have isolated the issue to a subset of the v7 to v8 changes, however
>>> it was not the exit latency changes.
>>>
>>> The partial revert to V7 changes I made were (on top of 5.3):
>>
>> The further testing showed a problem or two with my partial teo-v7 reversion
>> (I call it teo-v12) under slightly different testing conditions.

Correction:
There was no problem with my partial reversion kernel (a.k.a. teo-v12). The problem
was confusion over which kernel I was actually running for whatever test.

>>
>> I also have a 5.3 based kernel with the current teo reverted and the entire
>> teo-v7 put in its place. I have yet to find a idle state disabled related issue
>> with this kernel.
>>
>> I'll come back to this thread at a later date with better details and test results.
>
> Thanks for this work!
>
> Please also note that there is a teo patch in 5.4-rc1 that may make a
> difference in principle.

Yes, actually this saga started from somewhere between kernel 5.3 and 5.4-rc1,
and did include those teo patches, which actually significantly increases the
probability of the issue occurring.

When the deepest idle state is disabled, and the all states search loop exits
normally, it might incorrectly re-evaluate a previous idle state previously
deemed not worthy of the check. This was introduced between teo development
versions 7 and 8. The fix is to move the code back inside the loop.
(I'll submit a patch in a day or two).

I do not think I stated it clearly before: The problem here is that some CPUs
seem to get stuck in idle state 0, and when they do power consumption spikes,
often by several hundred % and often indefinitely.

I made a hack job automated test:
Kernel tests fail rate
5.4-rc1 6616 13.45%
5.3 2376 4.50%
5.3-teov7 12136 0.00% <<< teo.c reverted and teov7 put in its place.
5.4-rc1-ds 11168 0.00% <<< proposed patch (> 7 hours test time)

Proposed patch (on top of kernel 5.4-rc1):

doug@s15:~/temp-k-git/linux$ git diff
diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index b5a0e49..0502aa9 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -276,8 +276,22 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
if (idx < 0)
idx = i; /* first enabled state */

- if (s->target_residency > duration_us)
+ if (s->target_residency > duration_us){
+ /*
+ * If the "hits" metric of the idle state matching the sleep length is
+ * greater than its "misses" metric, that is the one to use. Otherwise,
+ * it is more likely that one of the shallower states will match the
+ * idle duration observed after wakeup, so take the one with the maximum
+ * "early hits" metric, but if that cannot be determined, just use the
+ * state selected so far.
+ */
+ if (cpu_data->states[idx].hits <= cpu_data->states[idx].misses &&
+ max_early_idx >= 0) {
+ idx = max_early_idx;
+ duration_us = drv->states[idx].target_residency;
+ }
break;
+ }

if (s->exit_latency > latency_req && constraint_idx > i)
constraint_idx = i;
@@ -293,20 +307,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
}

/*
- * If the "hits" metric of the idle state matching the sleep length is
- * greater than its "misses" metric, that is the one to use. Otherwise,
- * it is more likely that one of the shallower states will match the
- * idle duration observed after wakeup, so take the one with the maximum
- * "early hits" metric, but if that cannot be determined, just use the
- * state selected so far.
- */
- if (cpu_data->states[idx].hits <= cpu_data->states[idx].misses &&
- max_early_idx >= 0) {
- idx = max_early_idx;
- duration_us = drv->states[idx].target_residency;
- }
-
- /*
* If there is a latency constraint, it may be necessary to use a
* shallower idle state than the one selected so far.
*/