Re: [PATCH v10 1/2] printk: Make printk() completely async

From: Petr Mladek
Date: Thu Sep 01 2016 - 04:58:58 EST


On Wed 2016-08-31 21:52:24, Sergey Senozhatsky wrote:
> On (08/31/16 11:38), Petr Mladek wrote:
> > 2. Potential deadlocks when calling wake_up_process() by
> > async printk and console_unlock().
>
> * there are many reasons to those recursive printk() calls -- some
> can be addressed, some cannot. for instance, it doesn't matter how many
> per-CPU buffers we use for alternative printk() once the logbuf_lock is
> corrupted.

Yup and BTW: Peter Zijlstra wants to avoid zapping locks whenever
possible because it corrupts the state. It might solve the actual
state but it might cause deadlock by the double unlock.

> another `deadlock' example would be:
>
> SyS_ioctl
> do_vfs_ioctl
> tty_ioctl
> n_tty_ioctl
> tty_mode_ioctl
> set_termios
> tty_set_termios
> uart_set_termios
> uart_change_speed
> FOO_serial_set_termios
> spin_lock_irqsave(&port->lock) // lock the output port
> ....
> !! WARN() or pr_err() or printk()
> vprintk_emit()
> /* console_trylock() */
> console_unlock()
> call_console_drivers()
> FOO_write()
> spin_lock_irqsave(&port->lock) // already
> locked

Great catch! From the already mentioned solutions, I would prefer
using deferred variants of WARN()/BUG()/printk() on these locations.
Together with using lockdep to find these locations.

Also there is the Peter Zijlstra's idea of using a lockless
"early" console to debug the situations where it happens.
It might make sense to make such a console easy to use.

I am unable to find any other generic solution that would prevent this
from the printk() side at the moment.

> 5. not 100% guaranteed printing on panic
> not entirely related to printk(), but to console output mechanism in
> general. we have console_flush_on_panic() which ignores console semaphore
> state, to increase our chances of seeing the backtrace. however, there are
> more that just one lock involved: logbuf_lock, serial driver locks. so we may
> start zap_locks() in console_flush_on_panic() to re-init the logbuf_lock,
> but underlying serial driver's locks are still in unclear state. most of
> the drivers (if not all of them) take the port->lock under disabled IRQs,
> so if panic-CPU is not the one that holds the port->lock then the port->lock
> owner CPU will probably unlock the spin_lock before processing its STOP_IPI.
> if it's the port->lock CPU that panic() the system (nmi_panic() or BUG())
> then things can be bad.

That might be very hard to solve in general as well. Again the PeterZ's
idea with the lockless console might help here.

> > I wonder how to separate the problems and make them more manageable.
>
> so I was thinking for a moment about doing the recursion detection rework
> before the async_printk. just because better recursion detection is a nice
> thing to have in the first place and it probably may help us catching some
> of the surprises that async_printk might have. but it probably will be more
> problematic than I thought.
>
> then async_printk. I have a refreshed series on my hands, addressing
> Viresh's reports. it certainly makes things better, but it doesn't
> eliminate all of the lockups/etc sources.

We must separate historical possible lockups and new regressions.
Only regressions should block the async printk series. Old
bugs should be fixed separately to keep the series manageable.

Anyway, I think that the async printk will make sense even
when we solve all the other issues. If async printk does not
cause regressions, why not make it in.


> a console_unlock() doing
> wake_up_process(printk_kthread) would make it better.

I am not sure what you mean by this.

Thanks for working on it.

Best Regards,
Petr