Re: [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs
From: Sumit Garg
Date: Thu Aug 31 2023 - 09:16:57 EST
On Thu, 31 Aug 2023 at 18:38, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> On Thu, Aug 31, 2023 at 06:15:29PM +0530, Sumit Garg wrote:
> > Hi Mark,
> >
> > Thanks for putting up a test case for this.
> >
> > On Thu, 31 Aug 2023 at 15:40, Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > >
> > > Upon a panic() the kernel will use either smp_send_stop() or
> > > crash_smp_send_stop() to attempt to stop secondary CPUs via an IPI,
> > > which may or may not be an NMI. Generally it's preferable that this is an
> > > NMI so that CPUs can be stopped in as many situations as possible, but
> > > it's not always possible to provide an NMI, and there are cases where
> > > CPUs may be unable to handle the NMI regardless.
> > >
> > > This patch adds a test for panic() where all other CPUs are stuck with
> > > interrupts disabled, which can be used to check whether the kernel
> > > gracefully handles CPUs failing to respond to a stop, and whe NMIs stops
> >
> > s/whe/when/
> >
> > > work.
> > >
> > > For example, on arm64 *without* an NMI, this results in:
> > >
> > > | # echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
> > > | lkdtm: Performing direct entry PANIC_STOP_IRQOFF
> > > | Kernel panic - not syncing: panic stop irqoff test
> > > | CPU: 2 PID: 24 Comm: migration/2 Not tainted 6.5.0-rc3-00077-ge6c782389895-dirty #4
> > > | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > > | Stopper: multi_cpu_stop+0x0/0x1a0 <- stop_machine_cpuslocked+0x158/0x1a4
> > > | Call trace:
> > > | dump_backtrace+0x94/0xec
> > > | show_stack+0x18/0x24
> > > | dump_stack_lvl+0x74/0xc0
> > > | dump_stack+0x18/0x24
> > > | panic+0x358/0x3e8
> > > | lkdtm_PANIC+0x0/0x18
> > > | multi_cpu_stop+0x9c/0x1a0
> > > | cpu_stopper_thread+0x84/0x118
> > > | smpboot_thread_fn+0x224/0x248
> > > | kthread+0x114/0x118
> > > | ret_from_fork+0x10/0x20
> > > | SMP: stopping secondary CPUs
> > > | SMP: failed to stop secondary CPUs 0-3
> > > | Kernel Offset: 0x401cf3490000 from 0xffff800080000000
> > > | PHYS_OFFSET: 0x40000000
> > > | CPU features: 0x00000000,68c167a1,cce6773f
> > > | Memory Limit: none
> > > | ---[ end Kernel panic - not syncing: panic stop irqoff test ]---
> > >
> > > On arm64 *with* an NMI, this results in:
> >
> > I suppose a more interesting test scenario to show difference among
> > NMI stop IPI and regular stop IPI would be:
> >
> > - First put any CPU into hard lockup state via:
> > $ echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
> >
> > - And then provoke following from other CPU:
> > $ echo PANIC_STOP_IRQOFF > /sys/kernel/debug/provoke-crash/DIRECT
>
> I don't follow. IIUC that's only going to test whether a HW watchdog can fire
> and reset the system?
>
> The PANIC_STOP_IRQOFF test has each CPU run panic_stop_irqoff_fn() with IRQs
> disabled, and if one CPU is stuck in the HARDLOCKUP test, we'll never get all
> CPUs into panic_stop_irqoff_fn(), and so all CPUs will be stuck with IRQs
> disabled, spinning.
>
> The PANIC_STOP_IRQOFF test itself tests the different between an NMI stop IPI
> and regular stop IPI, as the results in the commit message shows. Look for the
> line above that says:
>
> | SMP: failed to stop secondary CPUs 0-3
>
> ... which is *not* present in the NMI case (though we don't have an explicit
> "stoppped all CPUs" message).
Ah, I see your point as I missed that difference when I first looked
up the panic() logs. So it's the post panic() CPU stop behaviour that
we are testing here. Thanks for the explanation.
-Sumit