Re: [PATCH] ACPI: processor idle: Practically limit "Dummy wait" workaround to old Intel systems
From: K Prateek Nayak
Date: Fri Sep 23 2022 - 16:42:34 EST
Hello Dave,
On 9/23/2022 12:17 AM, Dave Hansen wrote:
> Old, circa 2002 chipsets have a bug: they don't go idle when they are
> supposed to. So, a workaround was added to slow the CPU down and
> ensure that the CPU waits a bit for the chipset to actually go idle.
> This workaround is ancient and has been in place in some form since
> the original kernel ACPI implementation.
>
> But, this workaround is very painful on modern systems. The "inl()"
> can take thousands of cycles (see Link: for some more detailed
> numbers and some fun kernel archaeology).
>
> First and foremost, modern systems should not be using this code.
> Typical Intel systems have not used it in over a decade because it is
> horribly inferior to MWAIT-based idle.
>
> Despite this, people do seem to be tripping over this workaround on
> AMD system today.
>
> Limit the "dummy wait" workaround to Intel systems. Keep Modern AMD
> systems from tripping over the workaround. Remotely modern Intel
> systems use intel_idle instead of this code and will, in practice,
> remain unaffected by the dummy wait.
I've run 30 runs of tbench with 128 clients on a dual socket Zen3 system
(2 x 64C/128T) and do not see any massive regression like I used to when
we were hitting the dummy wait issue:
Kernel : baseline baseline + C2 disabled baseline + this patch
Min (MB/s) : 2215.06 33072.10 (+1393.05%) 30519.60 (+1277.82%)
Max (MB/s) : 32938.80 34399.10 32699.30
Median (MB/s) : 32191.80 33476.60 31418.90
AMean (MB/s) : 22448.55 33649.27 (+49.89%) 31545.93 (+40.52%)
AMean Stddev : 17526.70 680.14 1095.39
AMean CoefVar : 78.07% 2.02% 3.47%
The range is well within the variation we've normally seen with tbench
on the test platform.
>
> Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Cc: Len Brown <lenb@xxxxxxxxxx>
> Cc: Mario Limonciello <Mario.Limonciello@xxxxxxx>
> Cc: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Cc: Borislav Petkov <bp@xxxxxxxxx>
Can you please add a cc to stable?
Cc: stable@xxxxxxxxxxxxxxx
> Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> Reported-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> Link: https://lore.kernel.org/all/20220921063638.2489-1-kprateek.nayak@xxxxxxx/
> ---
> drivers/acpi/processor_idle.c | 23 ++++++++++++++++++++---
> 1 file changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 16a1663d02d4..9f40917c49ef 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -531,10 +531,27 @@ static void wait_for_freeze(void)
> /* No delay is needed if we are in guest */
> if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
> return;
> + /*
> + * Modern (>=Nehalem) Intel systems use ACPI via intel_idle,
> + * not this code. Assume that any Intel systems using this
> + * are ancient and may need the dummy wait. This also assumes
> + * that the motivating chipset issue was Intel-only.
> + */
> + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
Based on Andreas's comment, this problem is not limited to Intel chipsets
and affects at least the AMD Athlon on VIA chipset (circa 2006)
(https://lore.kernel.org/lkml/Yyy6l94G0O2B7Yh1@xxxxxxxxxxxxxxxxxxxxxx/)
To be on safer side, the exception could be made for AMD Fam 17h+ and also
Hygon as pointed out by Peter, where we know the dummy wait is unnecessary.
Extending the condition you proposed, we can have:
if (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON ||
((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) &&
(boot_cpu_data.x86_model >= 0x17)))
return;
It is not pretty by any means which is why we can use a x86_BUG_STPCLK to
limit the dummy op to only affected processors. This way, the x86 vendor
check and family check can be avoided in the acpi code. A v2 has been sent
out tackling the problem this way:
https://lore.kernel.org/lkml/20220923153801.9167-1-kprateek.nayak@xxxxxxx/
> + return;
> #endif
> - /* Dummy wait op - must do something useless after P_LVL2 read
> - because chipsets cannot guarantee that STPCLK# signal
> - gets asserted in time to freeze execution properly. */
> + /*
> + * Dummy wait op - must do something useless after P_LVL2 read
> + * because chipsets cannot guarantee that STPCLK# signal gets
> + * asserted in time to freeze execution properly
> + *
> + * This workaround has been in place since the original ACPI
> + * implementation was merged, circa 2002.
> + *
> + * If a profile is pointing to this instruction, please first
> + * consider moving your system to a more modern idle
> + * mechanism.
> + */
> inl(acpi_gbl_FADT.xpm_timer_block.address);
> }
>
The patch, as it is, solves the problem we've seen on the newer AMD
platforms with large core density that use IOPORT based C-states.
Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
--
Thanks and Regards,
Prateek