Re: [PATCH v2] mm/page_isolation: fix a deadlock with printk()

From: Petr Mladek
Date: Thu Oct 10 2019 - 07:11:28 EST

Next message: Heikki Krogerus: "Re: [PATCH v2 0/2] extcon: axp288: Move to swnodes"
Previous message: Shenhar, Talel: "Re: [PATCH v5 1/2] dt-bindings: soc: al-pos: Amazon's Annapurna Labs POS"
In reply to: Sergey Senozhatsky: "Re: [PATCH v2] mm/page_isolation: fix a deadlock with printk()"
Next in thread: Qian Cai: "Re: [PATCH v2] mm/page_isolation: fix a deadlock with printk()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu 2019-10-10 17:39:08, Sergey Senozhatsky wrote:
> On (10/10/19 10:21), Petr Mladek wrote:
> [..]
> > > > Considering that console.write is called from essentially arbitrary code
> > > > path IIUC then all the locks used in this path should be pretty much
> > > > tail locks or console internal ones without external dependencies.
> > >
> > > That's a good expectation, but I guess it's not always the case.
> > >
> > > One example might be NET console - net subsystem locks, net device
> > > drivers locks, maybe even some MM locks (skb allocations?).
> > >
> > > But even more "commonly used" consoles sometimes break that
> > > expectation. E.g. 8250
> > >
> > > serial8250_console_write()
> > > serial8250_modem_status()
> > > wake_up_interruptible()
> > >
> > > And so on.
> >
> > I think that the only maintainable solution is to call the console
> > drivers in a well defined context (kthread). We have finally opened
> > doors to do this change.
>
> Yeah, that's a pretty complex thing, I suspect. Panic flush to
> netcon may deadlock if oops occurs under one of those "important
> MM locks" (if any MM locks are actually involved in ATOMIC skb
> allocation). If there are such MM locks, then I think flush_on_panic
> issue can't be address by printing kthread or ->atomic_write callback.

Sure, we could not rely on kthreads in panic(). In this situation
any lock taken from console->write() callback is a possible source
of a deadlock.

Note that I say that the locks are the problem and not printk()
called under these locks. It is because:

+ The experience shows that we could not prevent people from
using printk() anywhere.

+ printk() might get called even when it is not used explicitly.
For example, from NMI, IRQ, Oops.

So, the best solution is to avoid as many locks in console->write()
callbacks as possible. Especially this means as less dependencies
on external subsystems as possible. MM is an obvious candidate.
We should avoid calling MM not only because it uses locks but
also because there might not be any available memory.

Of course, there are better and worse console drivers. It is hard
to expect that all will or can be made safe. From the printk()
point of view, the defense against the problematic consoles
might be:

+ Classify each console driver. Flush all messages to the most
reliable consoles first and to the least reliable ones at last.

+ Prevent calling consoles when there is other way to preserve
the messages over the reboot, e.g. crashdump or permanent memory.

Of course, we will still need a way how to actively search for
problems in advance. For example, printk() could use a fake
lock even during the normal operation so that it could trigger
lockdep splat. We could enable it after all the init calls
are proceed to reduce the number of false positives.

Hmm, this discussion probably belongs to another thread about
printk() design.

Anyway, it seems that removing MM from console->write() calls
is a win-win solution.

Best Regards,
Petr

Next message: Heikki Krogerus: "Re: [PATCH v2 0/2] extcon: axp288: Move to swnodes"
Previous message: Shenhar, Talel: "Re: [PATCH v5 1/2] dt-bindings: soc: al-pos: Amazon's Annapurna Labs POS"
In reply to: Sergey Senozhatsky: "Re: [PATCH v2] mm/page_isolation: fix a deadlock with printk()"
Next in thread: Qian Cai: "Re: [PATCH v2] mm/page_isolation: fix a deadlock with printk()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]