Re: [RFC PATCH 00/11] printk: safe printing in NMI context

From: Paul E. McKenney
Date: Wed Jun 18 2014 - 10:36:26 EST


On Wed, Jun 18, 2014 at 01:03:05PM +0200, Jiri Kosina wrote:
> On Tue, 10 Jun 2014, Linus Torvalds wrote:
>
> > > Lets be crazy and Cc Linus on that.
> >
> > Quite frankly, I hate seeing something like this:
> >
> > kernel/printk/printk.c | 1218 +++++++++++++++++++++++++----------
> >
> > for something that is stupid and broken. Printing from NMI context
> > isn't really supposed to work, and we all *know* it's not supposed to
> > work.
> >
> > I'd much rather disallow it, and if there is one or two places that
> > really want to print a warning and know that they are in NMI context,
> > have a special workaround just for them, with something that does
> > *not* try to make printk in general work any better.
> >
> > Dammit, NMI context is special. I absolutely refuse to buy into the
> > broken concept that we should make more stuff work in NMI context.
> > Hell no, we should *not* try to make more crap work in NMI. NMI people
> > should be careful.
> >
> > Make a trivial "printk_nmi()" wrapper that tries to do a trylock on
> > logbuf_lock, and *maybe* the existing sequence of
> >
> > if (console_trylock_for_printk())
> > console_unlock();
> >
> > then works for actually triggering the printout. But the wrapper
> > should be 15 lines of code for "if possible, try to print things", and
> > *not* a thousand lines of changes.
>
> Alright, so this went silent again without any real progress. Is everyone
> hoping this gets sorted out on kernel summit, or ... ?
>
> Let me sum up the current situation:
>
> - both RCU stall detector and 'echo l > sysrq-trigger' can (and we've
> seen it happening for real) cause a complete, undebuggable, silent hang
> of machine (deadlock in NMI context)

I could easily add an option to RCU to allow people to tell it not to
use NMIs to dump the stack. Would that help?

Thanx, Paul

> - before 7ff9554bb578 and friends, this was trivial to fix more or less
> exactly the way Linus is proposing above. We've been carrying the
> fix in our kernels for a while already [1]. With printk() having got
> overly complicated recently, the "in principle trivial" fix turns out
> into crazy mess due to handling of all the indexes, sequence numbers,
> etc.
>
> - printk() from NMI is actually useful in rare cases (such as inter-CPU
> stack dumping)
>
> - using lockless buffers that trace_printk() is using has its own
> problems, as described by Petr elsewhere in this thread
>
>
> I find it rather outrageous that fixing *real bugs* (leading to hangs)
> becomes impossible due to printk() being too complex. It's very
> unfortunate that the same level of pushback didn't happen when new
> features (that actually *made* it so complicated) have been pushed; that
> would be much more valuable an appropriate.
>
> I believe Jan Kara is in the same situation with his softlockup fixes for
> printk. Those are real problems, as they are bringing machines down, and
> after two years, still not fixed, because "printk() code is scary enough
> as-is"
>
> [1] http://kernel.suse.com/cgit/kernel/commit/?h=SLE11-SP3&id=8d62ae68ff61d77ae3c4899f05dbd9c9742b14c9
>
> --
> Jiri Kosina
> SUSE Labs
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/