Re: [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks()

From: Tejun Heo
Date: Wed Jan 24 2018 - 14:10:59 EST


Hello, Sergey, Steven.

On Wed, Jan 24, 2018 at 02:00:35PM +0900, Sergey Senozhatsky wrote:
> On (01/23/18 21:54), Steven Rostedt wrote:
> >
> > > Another problem, and I mentioned it somewhere in another email, is that
> > > upstream printk people don't receive enough [if any at all] feedback from
> > > guys who face printk issues. That's why every time printk_kthread re-surfaces
> > > the reaction is "this is not a real problem, no one is seeing printk issues
> > > like these, you idiot!". It'd be great to have more "we need ABC, because of
> > > XYZ, but printk crashes the system. Here is the backtrace, fix it" reports.
> > > As of now, those things mostly are not reported, that's why people are not
> > > convinced. Just my 5 cents.
> >
> > If you are seeing these issues, have whoever is reporting them to Cc
> > LKML, and those of us that would care to listen.
>
> OK. The lack of reports can also signify that we need to change the way we
> handle those reports. If we are going to reply "yes, your system crashed
> while doing completely innocent printout, but if we fix it then we can
> increase by 0.0001% chances of not getting any printouts at all in that
> corner case when your system is in recursive double panic over the latest
> bitcoin price and your keyboard is on fire" then I don't think people will
> care to report anything.
>
> Maybe Tejun will be kind enough to shed some light on how often FB fleet
> suffer from printk related issues, or what are the most common scenarios,
> etc. [sensitive information can be reported "off list"]

There are efforts to automatically scrub and share kernel splats
publicly, so hopefully we'd be able to provide a better visibility
into the problems we encounter in the future. For now, there are some
security implications and I'm not very sure how liberal I can share.

In terms of frequency, it isn't catastrophic. I think Chris Mason
described it pretty eloquently - "more often than hardware failures,
not so often that we turn off serial console". One painful part is
that it adds noise to signal by escalating an unrelated problem to RCU
stalls or hard lockups just by occupying an unlucky context for too
long.

The relevance of these messages fall pretty rapidly as the number of
consecutive lines increases, so it is frustrating to pay for them this
way. After 15s of flushing along with a number of "X printk messages
dropped" warnings, we aren't really doing anyone any service by trying
to uphold "transmit the message as close to the printking context as
possible".

Thanks.

--
tejun