Re: [RFC PATCH] nmi,printk: fix ABBA deadlock between nmi_backtrace and dump_stack_lvl

From: Petr Mladek
Date: Wed Jul 24 2024 - 11:08:37 EST


On Wed 2024-07-24 16:51:46, John Ogness wrote:
> On 2024-07-24, Petr Mladek <pmladek@xxxxxxxx> wrote:
> > My quess is that it looked like:
> >
> > CPU A CPU B
> >
> > printk()
> > console_try_lock_spinning()
> > console_unlock()
> > console_emit_next_record()
> > console_lock_spinning_enable();
> > con->write()
> > spin_lock(port->lock);
> >
> > printk_cpu_sync_get()
> > printk()
> > console_try_lock_spinning()
> > # spinning and wating for CPU B
> >
> > NMI:
> >
> > printk_cpu_sync_get()
> > # waiting for CPU A
> >
> > => DEADLOCK
> >
> >
> > The deadlock is caused under/by printk_cpu_sync_get() but only because
> > console_try_lock_spinning() is blocked. It is not a true "try_lock"
> > operation which should never get blocked.
> >
> > => The above patch should solve the problem as well. It will cause
> > that console_try_lock_spinning() would fail immediately on CPU A.
> >
> > Note that port->lock can't cause any deadlock in this scenario.
> > console_try_lock_spinning() will always fail on CPU A until
> > the NMI gets handled on CPU B.
> >
> > By other words, printk_cpu_sync_get() will behave as a tail lock
> > on CPU A because of the failing trylock.
>
> But only in _this_ scenario. The port lock could be taken by CPU B for
> non-console-printing reasons. Then you still have deadlock, due to
> spinning on the port lock.

I see. I agree that deferring printk on that CPU [0] is the right solution.

> [0] https://lore.kernel.org/lkml/87plrcqyii.fsf@xxxxxxxxxxxxxxxxxxxxx

Best Regards,
Petr