Re: [PATCH 0/4] printk: Softlockup avoidance

From: Jan Kara
Date: Tue Sep 22 2015 - 06:10:11 EST


On Fri 18-09-15 15:14:11, Andrew Morton wrote:
> On Wed, 19 Aug 2015 17:38:27 +0200 Jan Kara <jack@xxxxxxxx> wrote:
>
> > From: Jan Kara <jack@xxxxxxx>
> >
> > Hello,
> >
> > since lately there were several attempts at dealing with softlockups due
> > to heavy printk traffic [1] [2] and I've been also privately pinged by
> > couple of people about the state of the patch set, I've decided to respin
> > the patch set.
> >
> > To remind the original problem:
> >
> > Currently, console_unlock() prints messages from kernel printk buffer to
> > console while the buffer is non-empty. When serial console is attached,
> > printing is slow and thus other CPUs in the system have plenty of time
> > to append new messages to the buffer while one CPU is printing. Thus the
> > CPU can spend unbounded amount of time doing printing in console_unlock().
> > This is especially serious when printk() gets called under some critical
> > spinlock or with interrupts disabled.
> >
> > In practice users have observed a CPU can spend tens of seconds printing
> > in console_unlock() (usually during boot when hundreds of SCSI devices
> > are discovered) resulting in RCU stalls (CPU doing printing doesn't
> > reach quiescent state for a long time), softlockup reports (IPIs for the
> > printing CPU don't get served and thus other CPUs are spinning waiting
> > for the printing CPU to process IPIs), and eventually a machine death
> > (as messages from stalls and lockups append to printk buffer faster than
> > we are able to print). So these machines are unable to boot with serial
> > console attached. Also during artificial stress testing SATA disk
> > disappears from the system because its interrupts aren't served for too
> > long.
> >
> > This series addresses the problem in the following way: If CPU has printed
> > more that printk_offload (defaults to 1000) characters, it wakes up one
> > of dedicated printk kthreads (we don't use workqueue because that has
> > deadlock potential if printk was called from workqueue code). Once we find
> > out kthread is spinning on a lock, we stop printing, drop console_sem, and
> > let kthread continue printing. Since there are two printing kthreads, they
> > will pass printing between them and thus no CPU gets hogged by printing.
>
> I still hate your patchset ;)
>
> But nothing better suggests itself. I have a few review comments -
> please let's work through that stuff, get a fresh version out and we'll
> see how it goes.
>
> Is this patchset being used in the field? Perhaps in the suse kernel?
> If so, a mention of that in the changelog would help things along.

Yes, SUSE kernels contain these patches (well, previous versions of the
patch set...). So far they fix the issues reported by customers we haven't
observed any problems with those patches.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/