Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

From: Steven Rostedt
Date: Wed Oct 03 2018 - 13:37:11 EST


On Wed, 3 Oct 2018 10:16:08 -0700
Daniel Wang <wonderfly@xxxxxxxxxx> wrote:

> On Wed, Oct 3, 2018 at 2:14 AM Petr Mladek <pmladek@xxxxxxxx> wrote:
> >
> > On Tue 2018-10-02 21:23:27, Steven Rostedt wrote:
> > > I don't see the big deal of backporting this. The biggest complaints
> > > about backports are from fixes that were added to late -rc releases
> > > where the fixes didn't get much testing. This commit was added in 4.16,
> > > and hasn't had any issues due to the design. Although a fix has been
> > > added:
> > >
> > > c14376de3a1 ("printk: Wake klogd when passing console_lock owner")
> >
> > As I said, I am fine with backporting the console_lock owner stuff
> > into the stable release.
> >
> > I just wonder (like Sergey) what the real problem is. The console_lock
> > owner handshake is not fully reliable. It is might be good enough

I'm not sure what you mean by 'not fully reliable'

> > to prevent softlockup. But we should not relay on it to prevent
> > a deadlock.
>
> Yes. I myself was curious too. :)
>
> >
> > My new theory ;-)
> >
> > printk_safe_flush() is called in nmi_trigger_cpumask_backtrace().
> > => watchdog_timer_fn() is blocked until all backtraces are printed.
> >
> > Now, the original report complained that the system rebooted before
> > all backtraces were printed. It means that panic() was called
> > on another CPU. My guess is that it is from the hardlockup detector.
> > And the panic() was not able to flush the console because it was
> > not able to take console_lock.
> >
> > IMHO, there was not a real deadlock. The console_lock owner
> > handshake jsut helped to get console_lock in panic() and
> > flush all messages before reboot => it is reasonable
> > and acceptable fix.

Agreed.


>
> I had the same speculation. Tried to capture a lockdep snippet with
> CONFIG_PROVE_LOCKING turned on but didn't get anything. But
> maybe I was doing it wrong.
>
> >
> > Just to be sure. Daniel, could you please send a log with
> > the console_lock owner stuff backported? There we would see
> > who called the panic() and why it rebooted early.
>
> Sure. Here is one. It's a bit long but complete. I attached another log
> snippet below it which is what I got when `softlockup_panic` was turned
> off. The log was from the IRQ task that was flushing the printk buffer. I
> will be taking a closer look at it too but in case you'll find it helpful.

Just so I understand correctly. Does the panic hit with and without the
suggested backport patch? The only difference is that you get the full
output with the patch and limited output without it?

-- Steve