Re: [PATCH v3] printk: fix deadlock when kernel panic
From: Sergey Senozhatsky
Date: Wed Feb 10 2021 - 01:35:20 EST
On (21/02/10 11:48), Muchun Song wrote:
> printk_safe_flush_on_panic() caused the following deadlock on our
> server:
>
> CPU0: CPU1:
> panic rcu_dump_cpu_stacks
> kdump_nmi_shootdown_cpus nmi_trigger_cpumask_backtrace
> register_nmi_handler(crash_nmi_callback) printk_safe_flush
> __printk_safe_flush
> raw_spin_lock_irqsave(&read_lock)
> // send NMI to other processors
> apic_send_IPI_allbutself(NMI_VECTOR)
> // NMI interrupt, dead loop
> crash_nmi_callback
> printk_safe_flush_on_panic
> printk_safe_flush
> __printk_safe_flush
> // deadlock
> raw_spin_lock_irqsave(&read_lock)
>
> DEADLOCK: read_lock is taken on CPU1 and will never get released.
>
> It happens when panic() stops a CPU by NMI while it has been in
> the middle of printk_safe_flush().
>
> Handle the lock the same way as logbuf_lock. The printk_safe buffers
> are flushed only when both locks can be safely taken. It can avoid
> the deadlock _in this particular case_ at expense of losing contents
> of printk_safe buffers.
>
> Note: It would actually be safe to re-init the locks when all CPUs were
> stopped by NMI. But it would require passing this information
> from arch-specific code. It is not worth the complexity.
> Especially because logbuf_lock and printk_safe buffers have been
> obsoleted by the lockless ring buffer.
>
> Fixes: cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic")
> Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> Reviewed-by: Petr Mladek <pmladek@xxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@xxxxxxxxx>
-ss