Re: [PATCH v2 3/3] arm64: escalate smp_send_stop() to an SDEI NMI as a last resort

From: Kiryl Shutsemau

Date: Thu Jun 11 2026 - 13:47:54 EST

On Wed, Jun 10, 2026 at 03:50:32PM -0700, Doug Anderson wrote:
> Hi,
>
> On Tue, Jun 9, 2026 at 6:58 AM Kiryl Shutsemau <kirill@xxxxxxxxxxxxx> wrote:
> >
> > @@ -910,6 +911,35 @@ static void __noreturn ipi_cpu_crash_stop(unsigned int cpu, struct pt_regs *regs
> > #endif
> > }
> >
> > +#ifdef CONFIG_ARM_SDEI_NMI
> > +/*
> > + * Stop entry for the SDEI cross-CPU NMI service: its event-0 handler
> > + * lands here when this CPU was asked to stop. The bookkeeping mirrors
> > + * the IPI_CPU_STOP{,_NMI} handling; the park happens inside the SDEI
> > + * event, which is never completed -- completing it would have firmware
> > + * resume the interrupted (typically wedged) context. No PSCI CPU_OFF
> > + * either: powering off a PE that EL3 still considers mid-event invites
> > + * firmware trouble.
> > + */
> > +void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs)
> > +{
> > + unsigned int cpu = smp_processor_id();
> > +
> > + local_daif_mask();
> > +
> > + if (IS_ENABLED(CONFIG_KEXEC_CORE) && crash_stop)
> > + crash_save_cpu(regs, cpu);
> > +
> > + /* the ack the stop requester polls for */
> > + set_cpu_online(cpu, false);
> > +
> > + sdei_mask_local_cpu();
> > +
> > + cpu_park_loop();
> > +}
> > +NOKPROBE_SYMBOL(arm64_nmi_cpu_stop);
> > +#endif
>
> Can we combine everything into one function so we don't have to keep
> all the different stop functionality in sync? Like this (untested):
>

Okay. Look good to me. See the patch below.

void __noreturn arm64_nmi_cpu_stop(struct pt_regs *regs, bool die_on_crash)
The stop IPI handlers call it with die_on_crash=true, the SDEI handler
and panic_smp_self_stop() with false. Pretty much your sketch, with the
crash = IS_ENABLED(CONFIG_KEXEC_CORE) && crash_stop discriminator inside.

> FWIW, I'm not totally sure I followed the logic for why "die_on_crash"
> needs to be "false" for the SDEI case,

It's not about kexec mechanics, it's about the SDEI dispatch state.

The SDEI stop handler parks inside an SDEI event that it deliberately
never completes — completing it makes firmware resume the wedged
context, which is the opposite of what we want. PSCI CPU_OFF from inside
that not-yet-completed event silently wedges EL3 on at least one
production firmware (still root-causing on the firmware side), so the
SDEI path saves the crashed context and parks instead of powering off.

The only consequence is that an SMP capture kernel can't re-online that
CPU. The dump itself is complete. I've left "power the SDEI-stopped CPU
off too" as a follow-up and called it out in the cover letter. The IPI
crash path is unaffected and still does CPU_OFF, exactly as before.

> Perhaps when doing that you'd stop unconditionally getting the cpu in
> do_handle_IPI() and just call it for `IPI_KGDB_ROUNDUP` since it would
> now be the only consumer of that local variable.

I kept the local cpu — after the change it's still used by the
IPI_KGDB_ROUNDUP case and the default: pr_crit(), so it didn't become
single-use.

> > @@ -1263,6 +1293,29 @@ void smp_send_stop(void)
> > udelay(1);
> > }
> >
> > + /*
> > + * If CPUs are *still* online, try the SDEI cross-CPU NMI. Firmware
> > + * delivers it regardless of the target's DAIF state, so it reaches
> > + * a CPU spinning with interrupts masked, which neither rung above
> > + * could (without pseudo-NMI there is no NMI rung at all). Allow
> > + * 100ms: a firmware round-trip per CPU, with headroom.
> > + */
> > + if (num_other_online_cpus()) {
> > + /* re-snapshot after the rungs above took CPUs offline */
> > + smp_rmb();
> > + cpumask_copy(&mask, cpu_online_mask);
> > + cpumask_clear_cpu(smp_processor_id(), &mask);
> > +
> > + if (sdei_nmi_stop_cpus(&mask)) {
> > + pr_info("SMP: retry stop with SDEI NMI for CPUs %*pbl\n",
> > + cpumask_pr_args(&mask));
>
> Perhaps it's being a bit pedantic, but it's a little weird that you're
> printing a message that sounds like "I'm going to retry with SDEI"
> after you've already done it. It feels like it would be nominally
> cleaner (and more parellel with the pseudo-NMI case) if you could
> separately test if SDEI is available. Then sdei_nmi_stop_cpus() would
> just return void?

Fixed. There's now a sdei_nmi_active() predicate; the rung tests it,
prints, then calls sdei_nmi_stop_cpus(), which is now void. It mirrors
the pseudo-NMI rung's check-then-act shape.

>
>
> > @@ -59,8 +64,45 @@ static bool sdei_nmi_available;
> >
> > #define SDEI_NMI_EVENT 0
> >
> > +/*
> > + * Stop-request dispatch lives on the same SDEI event 0 as everything
> > + * else. The requesting CPU sets each target's bit in sdei_nmi_stop_mask
> > + * before signalling event 0; the target's handler test-and-clears its
> > + * bit and hands the CPU to arm64_nmi_cpu_stop(), which saves crash
> > + * state when the stop is a kdump crash-stop, marks the CPU offline
> > + * (which is what the requester polls for) and parks it.
> > + *
> > + * This mirrors the cpumask the framework's nmi_cpu_backtrace() consults
> > + * just below, and a shared mask rather than a separate SDEI event avoids
> > + * extra registrations from firmware.
> > + */
>
> Do you have any reasoning for why you don't pick a separate EVENT ID
> for "backtrace" vs. "stop". If you absolutely have to share an ID
> because they're a limited resource then I guess it's fine, but it
> would make the code easier to understand / reason about if they were
> separate IDs.
>
> If you had a separate EVENT ID, then it seems like you could
> completely eliminate the (potentially large) `sdei_nmi_stop_mask`
> variable, right? Any time a "STOP" event fires you can unconditionally
> consider it to be a stop w/ no globals needed, right?

Separate event IDs aren't available: SDEI_EVENT_SIGNAL only ever signals
event 0 — it's the one architecturally software-signalled event. Every
other event number is an interrupt-bound event that firmware has to
define and bind, which is the firmware dependency this series is
specifically trying not to add. So backtrace and stop are stuck sharing
event 0.

But you're right that the mask should go — just not via a second event. A
stop is terminal and system-wide (sdei_nmi_stop_cpus() is only reached
from smp_send_stop(), which never returns), so once a stop is requested
every later event-0 fire is a stop too. I replaced the cpumask with a
single write-once flag the handler reads; a backtrace that races in
after a stop has begun just stops that CPU, which is fine. So the
(potentially large) variable is gone.

> > @@ -115,6 +157,35 @@ bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu)
> > return true;
> > }
> >
> > +/*
> > + * Last rung of the stop escalation in smp_send_stop() (see
> > + * arch/arm64/kernel/smp.c). The caller runs the regular stop IPI (and
> > + * the pseudo-NMI stop IPI, where available) first; @mask holds whatever
> > + * stayed online through those -- typically CPUs wedged with interrupts
> > + * masked, unreachable by an IPI. Set each target's stop-request flag and
> > + * signal event 0 at it; a target acks by marking itself offline, which
> > + * the caller polls for.
> > + *
> > + * Returns false when SDEI isn't active, so the caller can skip the wait.
> > + */
> > +bool sdei_nmi_stop_cpus(const cpumask_t *mask)
> > +{
> > + unsigned int cpu;
> > +
> > + if (!sdei_nmi_available)
> > + return false;
> > +
> > + cpumask_or(&sdei_nmi_stop_mask, &sdei_nmi_stop_mask, mask);
>
> As per above, hopefully we can get rid of `sdei_nmi_stop_mask`. ...but
> if we keep it, I'm curious why "or" and not "copy"?

It doesn't matter anymore. Mask is gone.

Thanks for the feedback! Any other comments?

--------------------------------8<-----------------------------------------------