Re: [RFC PATCH] panic: fix deadlock in panic()
From: Petr Mladek
Date: Thu Jun 04 2020 - 04:29:52 EST
On Wed 2020-06-03 14:19:15, Cheng Jian wrote:
> A deadlock caused by logbuf_lock occurs when panic:
>
> a) Panic CPU is running in non-NMI context
> b) Panic CPU sends out shutdown IPI via NMI vector
> c) One of the CPUs that we bring down via NMI vector holded logbuf_lock
> d) Panic CPU try to hold logbuf_lock, then deadlock occurs.
>
> we try to re-init the logbuf_lock in printk_safe_flush_on_panic()
> to avoid deadlock, but it does not work here, because :
>
> Firstly, it is inappropriate to check num_online_cpus() here.
> When the CPU bring down via NMI vector, the panic CPU willn't
> wait too long for other cores to stop, so when this problem
> occurs, num_online_cpus() may be greater than 1.
>
> Secondly, printk_safe_flush_on_panic() is called after panic
> notifier callback, so if printk() is called in panic notifier
> callback, deadlock will still occurs. Eg, if ftrace_dump_on_oops
> is set, we print some debug information, it will try to hold the
> logbuf_lock.
>
> To avoid this deadlock, drop the num_online_cpus() check and call
> the printk_safe_flush_on_panic() before panic_notifier_list callback,
> attempt to re-init logbuf_lock from panic CPU.
It might cause double unlock (deadlock) on architectures that did not
use NMI to stop the CPUs.
I have created a conservative fix for this problem for SLES, see
https://github.com/openSUSE/kernel-source/blob/SLE15-SP2-UPDATE/patches.suse/printk-panic-Avoid-deadlock-in-printk-after-stopping-CPUs-by-NMI.patch
It solves the problem only on x86 architecture.
There are many hacks that try to solve various scenarios but it
is getting too complicated and does not solve all problems.
The only real solution is lockless printk(). First piece is a lockless
ringbuffer. See the last version at
https://lore.kernel.org/r/20200501094010.17694-1-john.ogness@xxxxxxxxxxxxx
We prefer to work on the lockless solution instead of adding more
complicated workarounds. This is why I even did not try to upstream
the patch for SLES.
In the meantime, you might also consider removing the offending
message from the panic notifier if it is not really important.
Best Regards,
Petr