Re: [PATCH v5 0/2] printk: Console owner and waiter logic cleanup

From: Mathieu Desnoyers
Date: Wed Jan 10 2018 - 13:40:47 EST


----- On Jan 10, 2018, at 12:02 PM, Tejun Heo tj@xxxxxxxxxx wrote:

> Hello, Linus, Andrew.
>
> On Wed, Jan 10, 2018 at 05:29:00PM +0100, Petr Mladek wrote:
>> Where is the acceptable compromise? I am not sure. So far, the most
>> forceful people (Linus) did not see softlockups as a big problem.
>> They rather wanted to see the messages.
>
> Can you please chime in? Would you be opposed to offloading to an
> independent context even if it were only for cases where we were
> already punting? The thing with the current offloading is that we
> don't know who we're offloading to. It might end up in faster or
> slower context, or more importantly a dangerous one.
>
> The particular case that we've been seeing regularly in the fleet was
> the following scenario.
>
> 1. Console is IPMI emulated serial console. Super slow. Also
> netconsole is in use.
> 2. System runs out of memory, OOM triggers.
> 3. OOM handler is printing out OOM debug info.
> 4. While trying to emit the messages for netconsole, the network stack
> / driver tries to allocate memory and then fail, which in turn
> triggers allocation failure or other warning messages. printk was
> already flushing, so the messages are queued on the ring.
> 5. OOM handler keeps flushing but 4 repeats and the queue is never
> shrinking. Because OOM handler is trapped in printk flushing, it
> never manages to free memory and no one else can enter OOM path
> either, so the system is trapped in this state.

Hi Tejun,

There appears to be two problems at hand. One is making sure a console
buffer owner only flushes a bounded amount of data. Steven&Co patches
seem to address this.

The second problem you describe here appears to be related to the
side-effects of console drivers, namely netconsole in this scenario.
Its use of the network stack can allocate memory, which can fail, and
therefore trigger more printk. Having a way to detect that code is
directly called from a printk driver, and making sure error handling
is _not_ done by pushing more printk messages to that printk driver in
those situations comes to mind as a possible solution.

The problem you describe seems to be _another_ issue of the current
printk implementation which Steven's approach does not address, but
I don't think that Steven's changes prevent doing further improvements
on the netconsole driver front.

I also don't see what's wrong in the incremental approach proposed by
Steven. Even though it does not fix your console driver problem, his
patchset appears to address some real-world latency issues.

Thanks,

Mathieu

>
> The system usually never recovers in time once this sort of condition
> hits and the following was the patch that I suggested which only punts
> when messages are already being punted and we can easily make it less
> punty by delaying the punting by N messages.
>
> http://lkml.kernel.org/r/20171102135258.GO3252168@xxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> We definitely can fix the above described case by e.g. preventing
> printk flushing task from queueing more messages or whatever, but it
> just seems really dumb for the system to die from things like this in
> general and it doesn't really take all that much to trigger the
> condition.
>
> Thanks.
>
> --
> tejun

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com