Re: [PATCH] lib/nmi_backtrace: print out the CPUs which fail to respond to NMI

From: Feng Tang

Date: Thu May 21 2026 - 21:46:37 EST

Hi Andrew,

Thanks for the review!

On Thu, May 21, 2026 at 03:37:16PM -0700, Andrew Morton wrote:
> On Thu, 21 May 2026 11:03:36 +0800 Feng Tang <feng.tang@xxxxxxxxxxxxxxxxx> wrote:
>
> > When debugging RCU stall cases, usually all CPUs will respond to the
> > NMI and print out the backtrace. But in some nasty or hardware related
> > cases, some CPUs may fail to respond in 10 seconds, and very likely
> > this is sign of severe issues.
> >
> > Paul E. McKenney has implemented the NMI backtrace stall check for x86,
> > and for other architectures, it should be also helpful to at least
> > print out those CPUs which failed to repond to the NMI, so that users
> > can get an early heads-up for possible CPU hard stall.
>
> That must be one messed up machine. Is this something you've
> encountered in real life?

Yes. A big parf of my worktime is to play with panic/lockup/rcustall/hung
bugs :). And we did see some real case, and if there is such warning, it
could have given us good hint to focus on the not-responding CPU. In one
case, kernel requested 31 CPUs to do the CPU backtrace, and only 30 CPUs
really did, while the left unnoticed CPU is the root cause.

> > --- a/lib/nmi_backtrace.c
> > +++ b/lib/nmi_backtrace.c
> > @@ -75,7 +75,13 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
> > mdelay(1);
> > touch_softlockup_watchdog();
> > }
> > - nmi_backtrace_stall_check(to_cpumask(backtrace_mask));
> > +
> > + if (!cpumask_empty(to_cpumask(backtrace_mask))) {
> > + pr_warn("After 10 seconds, these CPUS still haven't responded to the NMI: %*pbl\n",
> > + cpumask_pr_args(to_cpumask(backtrace_mask)));
> > +
> > + nmi_backtrace_stall_check(to_cpumask(backtrace_mask));
> > + }
>
> It's a nitpick, but
>
> : /* Wait for up to 10 seconds for all CPUs to do the backtrace */
> : for (i = 0; i < 10 * 1000; i++) {
> : if (cpumask_empty(to_cpumask(backtrace_mask)))
> : break;
> : mdelay(1);
> : touch_softlockup_watchdog();
> : }
> :
> : if (!cpumask_empty(to_cpumask(backtrace_mask))) {
> : pr_warn("After 10 seconds, these CPUS still haven't responded to the NMI: %*pbl\n",
>
> Here we're hard-coding "10" in two places and in a comment. It would
> be nicer to do
>
> #define FOO_TIMEOUT 10
>
> then use that throughout.
>
> (bonus points for figuring out how to paste that "10" into the
> pr_warn() control string rather than using %d!)

How about this followon patch?
---
diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c
index a113d3d669be..2810b8f478a4 100644
--- a/lib/nmi_backtrace.c
+++ b/lib/nmi_backtrace.c
@@ -27,6 +27,8 @@ static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
/* "in progress" flag of arch_trigger_cpumask_backtrace */
static unsigned long backtrace_flag;

+#define NMI_BT_TIMEOUT_SEC 10
+
/*
* When raise() is called it will be passed a pointer to the
* backtrace_mask. Architectures that call nmi_cpu_backtrace()
@@ -68,8 +70,8 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
raise(to_cpumask(backtrace_mask));
}

- /* Wait for up to 10 seconds for all CPUs to do the backtrace */
- for (i = 0; i < 10 * 1000; i++) {
+ /* Wait for up to NMI_BT_TIMEOUT_SEC seconds for all CPUs to do the backtrace */
+ for (i = 0; i < NMI_BT_TIMEOUT_SEC * 1000; i++) {
if (cpumask_empty(to_cpumask(backtrace_mask)))
break;
mdelay(1);
@@ -77,8 +79,8 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
}

if (!cpumask_empty(to_cpumask(backtrace_mask))) {
- pr_warn("After 10 seconds, these CPUS still haven't responded to the NMI: %*pbl\n",
- cpumask_pr_args(to_cpumask(backtrace_mask)));
+ pr_warn("After %d seconds, these CPUS still haven't responded to the NMI: %*pbl\n",
+ NMI_BT_TIMEOUT_SEC, cpumask_pr_args(to_cpumask(backtrace_mask)));

nmi_backtrace_stall_check(to_cpumask(backtrace_mask));
}