Re: 4.14 backport request for dbdda842fe96f: "printk: Add console owner and waiter logic to load balance console writes"

From: Petr Mladek
Date: Wed Oct 03 2018 - 05:14:07 EST


On Tue 2018-10-02 21:23:27, Steven Rostedt wrote:
> On Tue, 2 Oct 2018 17:15:17 -0700
> Daniel Wang <wonderfly@xxxxxxxxxx> wrote:
>
> > On Tue, Oct 2, 2018 at 1:42 AM Petr Mladek <pmladek@xxxxxxxx> wrote:
> > >
> > > Well, I still wonder why it helped and why you do not see it with 4.4.
> > > I have a feeling that the console owner switch helped only by chance.
> > > In fact, you might be affected by a race in
> > > printk_safe_flush_on_panic() that was fixed by the commit:
> > >
> > > 554755be08fba31c7 printk: drop in_nmi check from printk_safe_flush_on_panic()
> > >
> > > The above one commit might be enough. Well, there was one more
> > > NMI-related race that was fixed by:
> > >
> > > ba552399954dde1b printk: Split the code for storing a message into the log buffer
> > > a338f84dc196f44b printk: Create helper function to queue deferred console handling
> > > 03fc7f9c99c1e7ae printk/nmi: Prevent deadlock when accessing the main log buffer in NMI
> >
> > All of these commits already exist in 4.14 stable, since 4.14.68. The deadlock
> > still exists even when built from 4.14.73 (latest tag) though. And cherrypicking
> > dbdda842fe96 fixes it.
> >
>
> I don't see the big deal of backporting this. The biggest complaints
> about backports are from fixes that were added to late -rc releases
> where the fixes didn't get much testing. This commit was added in 4.16,
> and hasn't had any issues due to the design. Although a fix has been
> added:
>
> c14376de3a1 ("printk: Wake klogd when passing console_lock owner")

As I said, I am fine with backporting the console_lock owner stuff
into the stable release.

I just wonder (like Sergey) what the real problem is. The console_lock
owner handshake is not fully reliable. It is might be good enough
to prevent softlockup. But we should not relay on it to prevent
a deadlock.

My new theory ;-)

printk_safe_flush() is called in nmi_trigger_cpumask_backtrace().
=> watchdog_timer_fn() is blocked until all backtraces are printed.

Now, the original report complained that the system rebooted before
all backtraces were printed. It means that panic() was called
on another CPU. My guess is that it is from the hardlockup detector.
And the panic() was not able to flush the console because it was
not able to take console_lock.

IMHO, there was not a real deadlock. The console_lock owner
handshake jsut helped to get console_lock in panic() and
flush all messages before reboot => it is reasonable
and acceptable fix.

Just to be sure. Daniel, could you please send a log with
the console_lock owner stuff backported? There we would see
who called the panic() and why it rebooted early.

Best Regards,
Petr