Re: [PATCH] lib/nmi_backtrace: print out the CPUs which fail to respond to NMI

Next message: Mukesh R: "Re: [PATCH V3 09/11] x86/hyperv: Implement Hyper-V virtual IOMMU"
Previous message: Suzuki K Poulose: "Re: [PATCH v14 07/44] arm64: RMI: Configure the RMM with the host's page size"
In reply to: Feng Tang: "[PATCH] lib/nmi_backtrace: print out the CPUs which fail to respond to NMI"
Next in thread: Feng Tang: "Re: [PATCH] lib/nmi_backtrace: print out the CPUs which fail to respond to NMI"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Andrew Morton

Date: Thu May 21 2026 - 18:38:52 EST

On Thu, 21 May 2026 11:03:36 +0800 Feng Tang <feng.tang@xxxxxxxxxxxxxxxxx> wrote:

> When debugging RCU stall cases, usually all CPUs will respond to the
> NMI and print out the backtrace. But in some nasty or hardware related
> cases, some CPUs may fail to respond in 10 seconds, and very likely
> this is sign of severe issues.
>
> Paul E. McKenney has implemented the NMI backtrace stall check for x86,
> and for other architectures, it should be also helpful to at least
> print out those CPUs which failed to repond to the NMI, so that users
> can get an early heads-up for possible CPU hard stall.

That must be one messed up machine. Is this something you've
encountered in real life?

> --- a/lib/nmi_backtrace.c
> +++ b/lib/nmi_backtrace.c
> @@ -75,7 +75,13 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
> mdelay(1);
> touch_softlockup_watchdog();
> }
> - nmi_backtrace_stall_check(to_cpumask(backtrace_mask));
> +
> + if (!cpumask_empty(to_cpumask(backtrace_mask))) {
> + pr_warn("After 10 seconds, these CPUS still haven't responded to the NMI: %*pbl\n",
> + cpumask_pr_args(to_cpumask(backtrace_mask)));
> +
> + nmi_backtrace_stall_check(to_cpumask(backtrace_mask));
> + }

It's a nitpick, but

: /* Wait for up to 10 seconds for all CPUs to do the backtrace */
: for (i = 0; i < 10 * 1000; i++) {
: if (cpumask_empty(to_cpumask(backtrace_mask)))
: break;
: mdelay(1);
: touch_softlockup_watchdog();
: }
:
: if (!cpumask_empty(to_cpumask(backtrace_mask))) {
: pr_warn("After 10 seconds, these CPUS still haven't responded to the NMI: %*pbl\n",

Here we're hard-coding "10" in two places and in a comment. It would
be nicer to do

#define FOO_TIMEOUT 10

then use that throughout.

(bonus points for figuring out how to paste that "10" into the
pr_warn() control string rather than using %d!)