Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path
From: Christian Loehle
Date: Fri Jan 10 2025 - 09:54:29 EST
On 1/10/25 13:34, Rafael J. Wysocki wrote:
> On Fri, Jan 10, 2025 at 2:16 PM Christian Loehle
> <christian.loehle@xxxxxxx> wrote:
>>
>> On 1/10/25 12:53, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
>>>
>>> After commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length()
>>> call in some cases") the teo governor behaves a bit differently on
>>> systems where idle state 0 is a "polling" state (that is, it is not
>>> really an idle state, but a loop continuously executed by the CPU).
>>> Namely, on such systems it skips the tick_nohz_get_sleep_length() call
>>> if the target residency of the current candidate idle state is small
>>> enough.
>>>
>>> However, if state 0 itself was to be returned, it would be returned
>>> right away without calling tick_nohz_get_sleep_length() even on systems
>>> where it was not a "polling" state until commit 4b20b07ce72f ("cpuidle:
>>> teo: Don't count non-existent intercepts") that attempted to fix this
>>> problem.
>>>
>>> Unfortunately, commit 4b20b07ce72f has made the governor always call
>>> tick_nohz_get_sleep_length() when about to return state 0 early, even
>>> if that state is a "polling" one, which is inconsistent and defeats
>>> the purpose of commit 6da8f9ba5a87 in that case.
>>>
>>> Address this by adding a CPUIDLE_FLAG_POLLING check to the path where
>>> state 0 is returned early to prevent tick_nohz_get_sleep_length() from
>>> being called if it is a "polling" state.
>>>
>>> Fixes: 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts")
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
>>> ---
>>> drivers/cpuidle/governors/teo.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> --- a/drivers/cpuidle/governors/teo.c
>>> +++ b/drivers/cpuidle/governors/teo.c
>>> @@ -422,7 +422,8 @@
>>> first_suitable_idx = i;
>>> }
>>> }
>>> - if (!idx && prev_intercept_idx) {
>>> + if (!idx && prev_intercept_idx &&
>>> + !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) {
>>> /*
>>> * We have to query the sleep length here otherwise we don't
>>> * know after wakeup if our guess was correct.
>>>
>>>
>>>
>>
>> But then you do run into the issue of intercepts not being detected if
>> state0 is the right choice, don't you?
>
> That's true, but then on systems with a "polling" state 0 you still
> have this problem if the state returned early is not state 0. Say C1
> on x86.>
> The point here is that the behavior needs to be consistent, one way or another.
Yes, gotcha. Why not be consistent 'in the other way' then?
>
> The exact point of commit 6da8f9ba5a87 was to avoid calling
> tick_nohz_get_sleep_length() in some cases when the state to be
> returned is shallow enough and obviously that includes a "polling"
> state 0, possibly at the cost of being somewhat inaccurate in
> prediction.
Somewhat inaccurate meaning not making any prediction?
cpu_data->sleep_length_ns = KTIME_MAX;
How much is the harm for calling tick_nohz_get_sleep_length() when
polling anyway?
I know tick_nohz_get_sleep_length() is the majority of the usual
cpuidle entry path, but for many scenarios where state0 is appropriate
that should be pretty fast, no?
>
> Then you're seeing this intercept accumulation for state 0 when there
> are only 2 states in the table (or all of the other states are much
> higher target residency than state 0).
>
> Commit 4b20b07ce72f effectively caused tick_nohz_get_sleep_length() to
> be called every time on systems without a "polling" state 0, which was
> fair enough, but it also affected the other systems, which wasn't.
>
>> This would then enable intercept-detection only for <50% of the time,
>> another option is to not allow intercepts selecting a polling state, but
>> there were recent complaints about this exact behavior from Aboorva (+TO).
>> They don't have a low-latency non-polling state.
>>
>> https://lore.kernel.org/lkml/20240809073120.250974-1-aboorvad@xxxxxxxxxxxxx/
>
> If they don't have a "polling" state 0, they won't be affected by this
> patch and after commit 4b20b07ce72f, they'll always call
> tick_nohz_get_sleep_length(), so the current governor behavior is
> generally unsuitable for them.
They do though.
commit 5ddcc03a07ae ("powerpc/cpuidle: Set CPUIDLE_FLAG_POLLING for snooze state")
So they have a polling 'snooze' and a relatively high latency (hundreds usecs)
non-polling state and no deeper state.
So if they don't query sleep length on snooze on a (1us)-interrupt-wakeup heavy
workload they will get 50% state0 and 50% state1 (because intercepts recovered
due to not querying sleep length).
>
> I have an idea how to change it to be more accurate in prediction, but
> we'll see how it goes. Stay tuned.
I will.
Regards,
Christian