[tip: x86/urgent] x86/nmi: Upgrade NMI backtrace stall checks & messages

From: tip-bot2 for Paul E. McKenney
Date: Tue Mar 26 2024 - 05:36:13 EST


The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 3186b61812c007179f1852d8e63a0f0f7dd7c783
Gitweb: https://git.kernel.org/tip/3186b61812c007179f1852d8e63a0f0f7dd7c783
Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
AuthorDate: Thu, 07 Mar 2024 17:32:15 -08:00
Committer: Ingo Molnar <mingo@xxxxxxxxxx>
CommitterDate: Tue, 26 Mar 2024 10:07:59 +01:00

x86/nmi: Upgrade NMI backtrace stall checks & messages

The commit to improve NMI stall debuggability:

344da544f177 ("x86/nmi: Print reasons why backtrace NMIs are ignored")

.. has shown value, but widespread use has also identified a few
opportunities for improvement.

The systems have (as usual) shown far more creativity than that commit's
author, demonstrating yet again that failing CPUs can do whatever they want.

In addition, the current message format is less friendly than one might
like to those attempting to use these messages to identify failing CPUs.

Therefore, separately flag CPUs that, during the full time that the
stack-backtrace request was waiting, were always in an NMI handler,
were never in an NMI handler, or exited one NMI handler.

Also, split the message identifying the CPU and the time since that CPU's
last NMI-related activity so that a single line identifies the CPU without
any other variable information, greatly reducing the processing overhead
required to identify repeat-offender CPUs.

Co-developed-by: Breno Leitao <leitao@xxxxxxxxxx>
Signed-off-by: Breno Leitao <leitao@xxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Link: https://lore.kernel.org/r/ab4d70c8-c874-42dc-b206-643018922393@paulmck-laptop
---
arch/x86/kernel/nmi.c | 24 ++++++++++++++----------
1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 9a5b372..ed163c8 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -580,7 +580,7 @@ EXPORT_SYMBOL_GPL(asm_exc_nmi_kvm_vmx);

static char *nmi_check_stall_msg[] = {
/* */
-/* +--------- nsp->idt_seq_snap & 0x1: CPU is in NMI handler. */
+/* +--------- nmi_seq & 0x1: CPU is currently in NMI handler. */
/* | +------ cpu_is_offline(cpu) */
/* | | +--- nsp->idt_calls_snap != atomic_long_read(&nsp->idt_calls): */
/* | | | NMI handler has been invoked. */
@@ -628,22 +628,26 @@ void nmi_backtrace_stall_check(const struct cpumask *btp)
nmi_seq = READ_ONCE(nsp->idt_nmi_seq);
if (nsp->idt_nmi_seq_snap + 1 == nmi_seq && (nmi_seq & 0x1)) {
msgp = "CPU entered NMI handler function, but has not exited";
- } else if ((nsp->idt_nmi_seq_snap & 0x1) != (nmi_seq & 0x1)) {
- msgp = "CPU is handling NMIs";
- } else {
- idx = ((nsp->idt_seq_snap & 0x1) << 2) |
+ } else if (nsp->idt_nmi_seq_snap == nmi_seq ||
+ nsp->idt_nmi_seq_snap + 1 == nmi_seq) {
+ idx = ((nmi_seq & 0x1) << 2) |
(cpu_is_offline(cpu) << 1) |
(nsp->idt_calls_snap != atomic_long_read(&nsp->idt_calls));
msgp = nmi_check_stall_msg[idx];
if (nsp->idt_ignored_snap != READ_ONCE(nsp->idt_ignored) && (idx & 0x1))
modp = ", but OK because ignore_nmis was set";
- if (nmi_seq & 0x1)
- msghp = " (CPU currently in NMI handler function)";
- else if (nsp->idt_nmi_seq_snap + 1 == nmi_seq)
+ if (nsp->idt_nmi_seq_snap + 1 == nmi_seq)
msghp = " (CPU exited one NMI handler function)";
+ else if (nmi_seq & 0x1)
+ msghp = " (CPU currently in NMI handler function)";
+ else
+ msghp = " (CPU was never in an NMI handler function)";
+ } else {
+ msgp = "CPU is handling NMIs";
}
- pr_alert("%s: CPU %d: %s%s%s, last activity: %lu jiffies ago.\n",
- __func__, cpu, msgp, modp, msghp, j - READ_ONCE(nsp->recv_jiffies));
+ pr_alert("%s: CPU %d: %s%s%s\n", __func__, cpu, msgp, modp, msghp);
+ pr_alert("%s: last activity: %lu jiffies ago.\n",
+ __func__, j - READ_ONCE(nsp->recv_jiffies));
}
}