Re: [PATCH v5 0/2] printk: Console owner and waiter logic cleanup

From: Tejun Heo
Date: Wed Jan 10 2018 - 13:31:03 EST


Hello, Peter.

On Wed, Jan 10, 2018 at 07:21:53PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 10, 2018 at 09:02:23AM -0800, Tejun Heo wrote:
> > 2. System runs out of memory, OOM triggers.
> > 3. OOM handler is printing out OOM debug info.
> > 4. While trying to emit the messages for netconsole, the network stack
> > / driver tries to allocate memory and then fail, which in turn
> > triggers allocation failure or other warning messages. printk was
> > already flushing, so the messages are queued on the ring.
> > 5. OOM handler keeps flushing but 4 repeats and the queue is never
> > shrinking. Because OOM handler is trapped in printk flushing, it
> > never manages to free memory and no one else can enter OOM path
> > either, so the system is trapped in this state.
>
> Why not kill recursive OOM (msgs) ?

Sure, we can do that too, e.g. marking flushing thread and ignoring
new messages from it, although that does come with its own downsides.
The choices are

* If we can make printk safe without much downside, that'd be the best
option.

* If we decide that we can't do that in a reasonable way, we sure can
try to plug the identified cases. We might have to play a bit of
whack-a-mole (e.g. the feedback loop might not necessarily be from
the same context) but there likely are very few repeatable cases.

It could be me not knowing the history of the discussion but up until
now the discussion hasn't really gotten to that point since I brought
up the case that we've been seeing.

Thanks.

--
tejun