Re: [PATCH 3/4] arm64: wire SDEI NMI into the hardlockup watchdog
From: Doug Anderson
Date: Fri Jun 05 2026 - 18:14:02 EST
Hi,
On Fri, Jun 5, 2026 at 2:12 PM Kiryl Shutsemau <kirill@xxxxxxxxxxxxx> wrote:
>
> On Fri, Jun 05, 2026 at 01:03:05PM -0700, Doug Anderson wrote:
> > Hi,
> >
> > On Wed, Jun 3, 2026 at 7:36 AM Kiryl Shutsemau <kirill@xxxxxxxxxxxxx> wrote:
> > >
> > > From: "Kiryl Shutsemau (Meta)" <kas@xxxxxxxxxx>
> > >
> > > Select HAVE_HARDLOCKUP_DETECTOR_ARCH so the framework takes its backend
> > > from this driver. A per-CPU hrtimer checks its buddy's heartbeat and
> > > signals event 0 at a stalled CPU, which runs watchdog_hardlockup_check()
> > > NMI-like.
> > >
> > > The source is chosen at boot: SDEI if firmware provides it, otherwise a
> > > perf-NMI counter (pseudo-NMI) fallback -- one image covers both.
> > >
> > > Signed-off-by: Kiryl Shutsemau (Meta) <kas@xxxxxxxxxx>
> > > ---
> > > arch/arm64/Kconfig | 1 +
> > > drivers/firmware/Kconfig | 3 +
> > > drivers/firmware/sdei_nmi.c | 247 +++++++++++++++++++++++++++++++++++-
> > > 3 files changed, 248 insertions(+), 3 deletions(-)
> >
> > I'm a little confused about this patch. We already have a buddy
> > hardlockup detector using the hrtimer, and it's even been improved
> > recently to trigger in a smaller time bound. It looks as if you're
> > duplicating bits of the perf and buddy detector here?
> >
> > I don't think you need this patch at all. The existing buddy detector
> > + patches #1 and #2 in your series should be sufficient.
>
> You're mostly right.
>
> Buddy + #2 covers the console case (the remote branch triggers the
> culprit's backtrace, which #2 makes deliverable), and #4 gets the wedged
> CPU's registers into the vmcore.
>
> The one thing this patch adds that a config can't is boot-time source
> selection: PERF-compiled kernels have no detector on a pseudo_nmi=0
> boot, and PREFER_BUDDY costs the pseudo-NMI machines perf
> self-detection. But that's arguably out of scope for the patchset.
>
> I'll drop this patch in v2 and run PREFER_BUDDY here. If a runtime
> perf->buddy fallback ever materializes, the gap closes entirely.
Sure. If you're interested in trying to make pref vs. buddy coexist,
that should be done in a platform-agnostic way. Feel free to post
patches for that. I know we discussed this previously. Ah, here they
are:
https://lore.kernel.org/r/20250916145122.416128-1-wangjinchao600@xxxxxxxxx
I think those got bikeshedded to death and nobody cared enough to keep pushing.
FWIW, my belief is that the buddy detector is superior in every way
except that it can't detect when all CPUs lock up simultaneously.
...though I wonder if a nicer way to solve the "all CPUs locked up" is
to just NMI-enable the "bark" interrupt of a hardware watchdog timer.
That ought to be quite easy...
-Doug