Re: [PATCH v8 4/4] x86/smp native_play_dead: Prefer cpuidle_play_dead() over mwait_play_dead()
From: Rafael J. Wysocki
Date: Tue Dec 10 2024 - 15:51:11 EST
First off, I'd change the subject to something like "x86/smp:
Eliminate mwait_play_dead_cpuid_hint()" because that's what the patch
is doing.
On Wed, Dec 4, 2024 at 3:08 PM Patryk Wlazlyn
<patryk.wlazlyn@xxxxxxxxxxxxxxx> wrote:
>
> The current algorithm* for looking up the mwait hint for the deepest
> cstate, in mwait_play_dead_cpuid_hint() code works by inspecting CPUID
> leaf 0x5 and calculates the mwait hint based on the number of reported
> substates.
I would just say
"Currently, mwait_play_dead_cpuid_hint() looks up the MWAIT hint of
the deepest idle state by inspecting CPUID leaf 0x5 with the
assumption that, if the number of sub-states for a given major C-state
is nonzero, those sub-states are always represented by consecutive
numbers starting from 0."
> This approach depends on the hints associated with them to
> be continuous in the range [0, NUM_SUBSTATES-1]. This continuity is not
> documented and is not met on the recent Intel platforms.
And then
"This assumption is not based on the documented platform behavior and
in fact it is not met on recent Intel platforms."
>
> * The current algorithm is implemented in the for loop inspecting edx
> in mwait_play_dead_cpuid_hint().
The above sentence does not add any value IMV.
> For example, Intel's Sierra Forest report two cstates with two substates
> each in cpuid leaf 0x5:
It's "Intel Sierra Forest" and if you said C-states above, please be
consistent and say C-states here too.
>
> Name* target cstate target subcstate (mwait hint)
> ===========================================================
> C1 0x00 0x00
> C1E 0x00 0x01
>
> -- 0x10 ----
>
> C6S 0x20 0x22
> C6P 0x20 0x23
>
> -- 0x30 ----
>
> /* No more (sub)states all the way down to the end. */
> ===========================================================
>
> * Names of the cstates are not included in the CPUID leaf 0x5, they are
> taken from the product specific documentation.
>
> Notice that hints 0x20 and 0x21 are skipped entirely for the target
> cstate 0x20 (C6), being a cause of the problem for the current cpuid
> leaf 0x5 algorithm.
And here
"Notice that hints 0x20 and 0x21 are not defined for C-state 0x20
(C6), so the existing MWAIT hint lookup in
mwait_play_dead_cpuid_hint() based on the CPUID leaf 0x5 contents does
not work in this case."
> Remove the old implementation of play_dead MWAIT hint calculation based
> on the CPUID leaf 0x5 in mwait_play_dead_cpuid_hint() and delegate
> calling of the mwait_play_dead() to the idle driver.
Well, the above is not exactly what's going on.
I'd say something like
"Instead of using MWAIT hint lookup that is not guaranteed to work,
make native_play_dead() rely on the idle driver for the given platform
to put CPUs going offline into appropriate idle state and, if that
fails, fall back to hlt_play_dead().
Accordingly, drop mwait_play_dead_cpuid_hint() altogether and make
native_play_dead() call cpuidle_play_dead() instead of it
unconditionally with the assumption that it will not return if it is
successful. Still, in case cpuidle_play_dead() fails, call
hlt_play_dead() at the end."
> Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@xxxxxxxxxxxxxxx>
> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@xxxxxxx>
> ---
> arch/x86/kernel/smpboot.c | 56 +++++----------------------------------
> 1 file changed, 7 insertions(+), 49 deletions(-)
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 8a3545c2cae9..82801137486d 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1272,6 +1272,10 @@ void play_dead_common(void)
> local_irq_disable();
> }
>
> +/*
> + * We need to flush the caches before going to sleep, lest we have
> + * dirty data in our caches when we come back up.
> + */
> void __noreturn mwait_play_dead(unsigned int eax_hint)
> {
> struct mwait_cpu_dead *md = this_cpu_ptr(&mwait_cpu_dead);
> @@ -1317,52 +1321,6 @@ void __noreturn mwait_play_dead(unsigned int eax_hint)
> }
> }
>
> -/*
> - * We need to flush the caches before going to sleep, lest we have
> - * dirty data in our caches when we come back up.
> - */
> -static inline void mwait_play_dead_cpuid_hint(void)
> -{
> - unsigned int eax, ebx, ecx, edx;
> - unsigned int highest_cstate = 0;
> - unsigned int highest_subcstate = 0;
> - int i;
> -
> - if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD ||
> - boot_cpu_data.x86_vendor == X86_VENDOR_HYGON)
> - return;
> - if (!this_cpu_has(X86_FEATURE_MWAIT))
> - return;
> - if (!this_cpu_has(X86_FEATURE_CLFLUSH))
> - return;
> - if (__this_cpu_read(cpu_info.cpuid_level) < CPUID_MWAIT_LEAF)
> - return;
> -
> - eax = CPUID_MWAIT_LEAF;
> - ecx = 0;
> - native_cpuid(&eax, &ebx, &ecx, &edx);
> -
> - /*
> - * eax will be 0 if EDX enumeration is not valid.
> - * Initialized below to cstate, sub_cstate value when EDX is valid.
> - */
> - if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED)) {
> - eax = 0;
> - } else {
> - edx >>= MWAIT_SUBSTATE_SIZE;
> - for (i = 0; i < 7 && edx; i++, edx >>= MWAIT_SUBSTATE_SIZE) {
> - if (edx & MWAIT_SUBSTATE_MASK) {
> - highest_cstate = i;
> - highest_subcstate = edx & MWAIT_SUBSTATE_MASK;
> - }
> - }
> - eax = (highest_cstate << MWAIT_SUBSTATE_SIZE) |
> - (highest_subcstate - 1);
> - }
> -
> - mwait_play_dead(eax);
> -}
> -
> /*
> * Kick all "offline" CPUs out of mwait on kexec(). See comment in
> * mwait_play_dead().
> @@ -1413,9 +1371,9 @@ void native_play_dead(void)
> play_dead_common();
> tboot_shutdown(TB_SHUTDOWN_WFS);
>
> - mwait_play_dead_cpuid_hint();
> - if (cpuidle_play_dead())
> - hlt_play_dead();
> + /* Below returns only on error. */
> + cpuidle_play_dead();
> + hlt_play_dead();
> }
>
> #else /* ... !CONFIG_HOTPLUG_CPU */
> --