Re: Process for severe early stable bugs?

From: Greg KH
Date: Mon Dec 10 2018 - 04:51:08 EST


On Sun, Dec 09, 2018 at 11:44:19AM -0500, Theodore Y. Ts'o wrote:
> On Sun, Dec 09, 2018 at 12:30:39PM +0100, Greg KH wrote:
> > > P.P.P.S. If I were king, I'd be asking for a huge number of kunit
> > > tests for block-mq to be developed, and then running them under a
> > > Thread Sanitizer.
> >
> > Isn't that what xfs and fio is? Aren't we running this all the time and
> > reporting those issues? How did this bug not show up on those tests, is
> > it just because they didn't run long enough?
> >
> > Because of those test suites, I was thinking that the block and
> > filesystem paths were one of the more well-tested things we had at the
> > moment, is this not true?
>
> I'm pretty confident about the file system paths, and the "happy
> paths" for the block layer.
>
> But with Kernel Bugzilla #201685, despite huge amounts both before and
> after 4.19-rc1, nothing picked it up. It turned out to be very
> configuration specific, *and* only happened when you were under heavy
> memory pressure and/or I/O pressure.
>
> I'm starting to try to use blktests, but it's not as mature as
> xfstests. It has portability issues, as it assumes a much newer
> userspace. So I can't even run it under some environments at all.
> The test coverage just isn't as broad. Compare:
>
> ext4/4k: 441 tests, 1 failures, 42 skipped, 4387 seconds
> Failures: generic/388
>
> Versus:
>
> Run: block/001 block/002 block/003 block/004 block/005 block/006
> block/009 block/010 block/012 block/013 block/014 block/015
> block/016 block/017 block/018 block/020 block/021 block/023
> block/024 loop/001 loop/002 loop/003 loop/004 loop/005 loop/006
> nvme/002 nvme/003 nvme/004 nvme/006 nvme/007 nvme/008 nvme/009
> nvme/010 nvme/011 nvme/012 nvme/013 nvme/014 nvme/015 nvme/016
> nvme/017 nvme/019 nvme/020 nvme/021 nvme/022 nvme/023 nvme/024
> nvme/025 nvme/026 nvme/027 nvme/028 scsi/001 scsi/002 scsi/003
> scsi/004 scsi/005 scsi/006 srp/001 srp/002 srp/003 srp/004
> srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 srp/013
> Failures: block/017 block/024 nvme/002 nvme/003 nvme/008 nvme/009
> nvme/010 nvme/011 nvme/012 nvme/013 nvme/014 nvme/015 nvme/016
> nvme/019 nvme/020 nvme/021 nvme/022 nvme/023 nvme/024 nvme/025
> nvme/026 nvme/027 nvme/028 scsi/006 srp/001 srp/002 srp/003 srp/004
> srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 srp/013
> Failed 37 of 69 tests
>
> (Most of the failures are test portability issues that I still need to
> work through, not real failures. But just look at the number of
> tests....)

So you are saying quantity rules over quantity? :)

It's really hard to judge this, given that xfstests are testing a whole
range of other things (POSIX compliance and stressing the vfs api),
while blktests are there to stress the block i/o api/interface.

So both would be best to run as we know xfstests also hits the block
layer...

thanks,

greg k-h