Re: [PATCH v1 1/4] cpuidle: teo: Add polling flag check to early return path
From: Rafael J. Wysocki
Date: Fri Jan 10 2025 - 10:18:06 EST
On Fri, Jan 10, 2025 at 3:52 PM Christian Loehle
<christian.loehle@xxxxxxx> wrote:
>
> On 1/10/25 13:34, Rafael J. Wysocki wrote:
> > On Fri, Jan 10, 2025 at 2:16 PM Christian Loehle
> > <christian.loehle@xxxxxxx> wrote:
> >>
> >> On 1/10/25 12:53, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> >>>
> >>> After commit 6da8f9ba5a87 ("cpuidle: teo: Skip tick_nohz_get_sleep_length()
> >>> call in some cases") the teo governor behaves a bit differently on
> >>> systems where idle state 0 is a "polling" state (that is, it is not
> >>> really an idle state, but a loop continuously executed by the CPU).
> >>> Namely, on such systems it skips the tick_nohz_get_sleep_length() call
> >>> if the target residency of the current candidate idle state is small
> >>> enough.
> >>>
> >>> However, if state 0 itself was to be returned, it would be returned
> >>> right away without calling tick_nohz_get_sleep_length() even on systems
> >>> where it was not a "polling" state until commit 4b20b07ce72f ("cpuidle:
> >>> teo: Don't count non-existent intercepts") that attempted to fix this
> >>> problem.
> >>>
> >>> Unfortunately, commit 4b20b07ce72f has made the governor always call
> >>> tick_nohz_get_sleep_length() when about to return state 0 early, even
> >>> if that state is a "polling" one, which is inconsistent and defeats
> >>> the purpose of commit 6da8f9ba5a87 in that case.
> >>>
> >>> Address this by adding a CPUIDLE_FLAG_POLLING check to the path where
> >>> state 0 is returned early to prevent tick_nohz_get_sleep_length() from
> >>> being called if it is a "polling" state.
> >>>
> >>> Fixes: 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts")
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> >>> ---
> >>> drivers/cpuidle/governors/teo.c | 3 ++-
> >>> 1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> --- a/drivers/cpuidle/governors/teo.c
> >>> +++ b/drivers/cpuidle/governors/teo.c
> >>> @@ -422,7 +422,8 @@
> >>> first_suitable_idx = i;
> >>> }
> >>> }
> >>> - if (!idx && prev_intercept_idx) {
> >>> + if (!idx && prev_intercept_idx &&
> >>> + !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)) {
> >>> /*
> >>> * We have to query the sleep length here otherwise we don't
> >>> * know after wakeup if our guess was correct.
> >>>
> >>>
> >>>
> >>
> >> But then you do run into the issue of intercepts not being detected if
> >> state0 is the right choice, don't you?
> >
> > That's true, but then on systems with a "polling" state 0 you still
> > have this problem if the state returned early is not state 0. Say C1
> > on x86.>
> > The point here is that the behavior needs to be consistent, one way or another.
>
> Yes, gotcha. Why not be consistent 'in the other way' then?
>
> >
> > The exact point of commit 6da8f9ba5a87 was to avoid calling
> > tick_nohz_get_sleep_length() in some cases when the state to be
> > returned is shallow enough and obviously that includes a "polling"
> > state 0, possibly at the cost of being somewhat inaccurate in
> > prediction.
>
> Somewhat inaccurate meaning not making any prediction?
> cpu_data->sleep_length_ns = KTIME_MAX;
Yes, and the wakeup is going to be counted as an intercept which of
course may be inaccurate, but how inaccurate it really is depends on
the workload. If the workload doesn't contain short timers, it
actually is as good as it gets.
> How much is the harm for calling tick_nohz_get_sleep_length() when
> polling anyway?
On x86, a lot, especially on systems with many cores.
This was the whole reason for doing commit 6da8f9ba5a87 in the first place.
> I know tick_nohz_get_sleep_length() is the majority of the usual
> cpuidle entry path, but for many scenarios where state0 is appropriate
> that should be pretty fast, no?
Not necessarily.
For single-digit exit latency idle states the
tick_nohz_get_sleep_length() evaluation time may exceed the state exit
latency and is comparable to the target residency, so calling it
pretty much doesn't make sense.
> >
> > Then you're seeing this intercept accumulation for state 0 when there
> > are only 2 states in the table (or all of the other states are much
> > higher target residency than state 0).
> >
> > Commit 4b20b07ce72f effectively caused tick_nohz_get_sleep_length() to
> > be called every time on systems without a "polling" state 0, which was
> > fair enough, but it also affected the other systems, which wasn't.
> >
> >> This would then enable intercept-detection only for <50% of the time,
> >> another option is to not allow intercepts selecting a polling state, but
> >> there were recent complaints about this exact behavior from Aboorva (+TO).
> >> They don't have a low-latency non-polling state.
> >>
> >> https://lore.kernel.org/lkml/20240809073120.250974-1-aboorvad@xxxxxxxxxxxxx/
> >
> > If they don't have a "polling" state 0, they won't be affected by this
> > patch and after commit 4b20b07ce72f, they'll always call
> > tick_nohz_get_sleep_length(), so the current governor behavior is
> > generally unsuitable for them.
>
> They do though.
> commit 5ddcc03a07ae ("powerpc/cpuidle: Set CPUIDLE_FLAG_POLLING for snooze state")
> So they have a polling 'snooze' and a relatively high latency (hundreds usecs)
> non-polling state and no deeper state.
>
> So if they don't query sleep length on snooze on a (1us)-interrupt-wakeup heavy
> workload they will get 50% state0 and 50% state1 (because intercepts recovered
> due to not querying sleep length).
Obviously.
OK, I guess I'll just do the other cleanups without this change to start with.