Re: [RFC Patch v1 0/4] arm64: Introduce new IPI as IPI_CALL_NMI_FUNC

From: Doug Anderson
Date: Fri Apr 24 2020 - 16:50:06 EST


Hi,

On Fri, Apr 24, 2020 at 4:11 AM Sumit Garg <sumit.garg@xxxxxxxxxx> wrote:
>
> With pseudo NMIs support available its possible to configure SGIs to be
> triggered as pseudo NMIs running in NMI context. And kernel features
> such as kgdb relies on NMI support to round up CPUs which are stuck in
> hard lockup state with interrupts disabled.
>
> This patch-set adds support for IPI_CALL_NMI_FUNC which can be triggered
> as a pseudo NMI which in turn is leveraged via kgdb to round up CPUs.
>
> After this patch-set we should be able to get a backtrace for a CPU
> stuck in HARDLOCKUP. Have a look at an example below from a testcase run
> on Developerbox:
>
> $ echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
>
> # Enter kdb via Magic SysRq
>
> [11]kdb> btc
> btc: cpu status: Currently on cpu 11
> Available cpus: 0-10(I), 11, 12(I), 13, 14-23(I)
> <snip>
> Stack traceback for pid 623
> 0xffff00086a644600 623 622 1 13 R 0xffff00086a644fc0 bash
> CPU: 13 PID: 623 Comm: bash Not tainted 5.7.0-rc2 #27
> Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS build #73 Apr 6 2020
> Call trace:
> dump_backtrace+0x0/0x198
> show_stack+0x18/0x28
> dump_stack+0xb8/0x100
> kgdb_cpu_enter+0x5c0/0x5f8
> kgdb_nmicallback+0xa0/0xa8
> handle_IPI+0x190/0x200
> gic_handle_irq+0x2b8/0x2d8
> el1_irq+0xcc/0x180
> lkdtm_HARDLOCKUP+0x8/0x18
> direct_entry+0x124/0x1c0
> full_proxy_write+0x60/0xb0
> __vfs_write+0x1c/0x48
> vfs_write+0xe4/0x1d0
> ksys_write+0x6c/0xf8
> __arm64_sys_write+0x1c/0x28
> el0_svc_common.constprop.0+0x74/0x1f0
> do_el0_svc+0x24/0x90
> el0_sync_handler+0x178/0x2b8
> el0_sync+0x158/0x180
> <snip>
>
> Looking forward to your comments/feedback.
>
> Sumit Garg (4):
> arm64: smp: Introduce a new IPI as IPI_CALL_NMI_FUNC
> irqchip/gic-v3: Add support to handle SGI as pseudo NMI
> irqchip/gic-v3: Enable arch specific IPI as pseudo NMI
> arm64: kgdb: Round up cpus using IPI_CALL_NMI_FUNC
>
> arch/arm64/include/asm/hardirq.h | 2 +-
> arch/arm64/include/asm/smp.h | 1 +
> arch/arm64/kernel/kgdb.c | 15 +++++++++++++++
> arch/arm64/kernel/smp.c | 36 +++++++++++++++++++++++++++++++++++-
> drivers/irqchip/irq-gic-v3.c | 36 +++++++++++++++++++++++++++++++-----
> 5 files changed, 83 insertions(+), 7 deletions(-)

This is amazing!

* picked your patches back to my current 5.4 tree
* turned on "CONFIG_ARM64_PSEUDO_NMI"
* set the "irqchip.gicv3_pseudo_nmi=1" command line

...and bam I can trace on the locked up CPU instead of being left in the dark.

I'm not sure I'm going to be too much use in actually doing the review
of the code since I'm not really an expert at how SGIs work (it took
me a while to realize that it must stand for software generated
interrupts) nor the bowels of the GIC. I tried to do what little
review I could.

In any case, I'll keep this in my local patch stack for now and keep
testing it to make sure I don't notice any weird problems.

-Doug