Re: [PATCH] arm64: smp: smp_send_stop() and crash_smp_send_stop() should try non-NMI first

From: Doug Anderson
Date: Thu Feb 29 2024 - 13:34:55 EST


Hi,

On Wed, Feb 28, 2024 at 5:11 AM Daniel Thompson
<daniel.thompson@xxxxxxxxxx> wrote:
>
> > I'm still hoping to get some sort of feedback here. If people think
> > this is a terrible idea then I'll shut up now and leave well enough
> > alone, but it would be nice to actively decide and get the patch out
> > of limbo.
>
> I've read patch through a couple of times and was generally convinced by
> the "do what x86 does" argument.
>
> However until now I've always held my council since I wasn't familiar
> with these code paths and I figured it was OK for me to have no opinion
> because the first line of the description says that kgdb/kdb is 100% not
> involved in causing the problem ;-) .
>
> However today I also took a look at the HAVE_NMI architectures and there
> is no consensus between them about how to implement this: PowerPC uses
> NMI and most of the others use IRQ only, s390 special cases for the
> panic code path and acts differently compared to a normal SMP shutdown.

Thanks for taking a look! I think I just included you since long ago
you were involved in the pseudo-NMI patches. ;-)


> FWIW the x86 route was irq-only and then switching to irq-plus-nmi
> (after a short trial with NMI-only that had problems with pstore
> reliability[1]) and that approach has been in place for over
> a decade now!

Ah, interesting. I guess this isn't a problem for me at the moment
since we're not using any alternate pstore backends (ChromeOS just
does pstore to RAM), but it's good to confirm that people were facing
real issues. This matches what my gut told me: that it's nice to give
CPUs a little chance to shut down more cleanly before jamming an NMI
down their throats.


> However, if we talking ourselves into copying x86 then perhaps we should
> more accurately copy x86! Assuming I read the x86 code correctly then
> crash_smp_send_stop() will (mostly) go staight to NMI rather
> than trialling an IRQ first! That is not what is currently implemented
> in the patch for arm64.

Sure, I'm happy to change the patch to work that way, though I might
wait to get some confirmation from a maintainer that they think this
idea is worth pursuing before spending more time on it. I don't think
it would be hard to have the "crash stop" code jump straight to NMI if
that's what people want. Matching x86 here seems reasonable, though
I'd also say that my gut still says that even for crash stop we should
try to stop things cleanly before jumping to NMI. I guess I could
imagine that the code we're kexec-ing to generate the core file might
be more likely to find the hardware in a funny state if we stopped
CPUs w/ NMI vs IRQ.


-Doug