Re: Kernel Concurrency Sanitizer (KCSAN)

From: Marco Elver
Date: Fri Sep 20 2019 - 13:51:18 EST


On Fri, 20 Sep 2019 at 18:47, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>
> On Fri, Sep 20, 2019 at 6:31 PM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> >
> > On Fri, Sep 20, 2019 at 04:18:57PM +0200, Marco Elver wrote:
> > > We would like to share a new data-race detector for the Linux kernel:
> > > Kernel Concurrency Sanitizer (KCSAN) --
> > > https://github.com/google/ktsan/wiki/KCSAN (Details:
> > > https://github.com/google/ktsan/blob/kcsan/Documentation/dev-tools/kcsan.rst)
> >
> > Nice!
> >
> > BTW kcsan_atomic_next() is missing a stub definition in <linux/kcsan.h>
> > when !CONFIG_KCSAN:
> >
> > https://github.com/google/ktsan/commit/a22a093a0f0d0b582c82cdbac4f133a3f61d207c#diff-19d7c475b4b92aab8ba440415ab786ec
> >
> > ... and I think the kcsan_{begin,end}_atomic() stubs need to be static
> > inline too.

Thanks for catching, fixed and pushed. Feel free to rebase your arm64 branch.

> > It looks like this is easy enough to enable on arm64, with the only real
> > special case being secondary_start_kernel() which we might want to
> > refactor to allow some portions to be instrumented.
> >
> > I pushed the trivial patches I needed to get arm64 booting to my arm64/kcsan
> > branch:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/kcsan

Cool, thanks for testing!

> > We have some interesting splats at boot time in stop_machine, which
> > don't seem to have been hit/fixed on x86 yet in the kcsan-with-fixes
> > branch, e.g.
> >
> > [ 0.237939] ==================================================================
> > [ 0.239431] BUG: KCSAN: data-race in multi_cpu_stop+0xa8/0x198 and set_state+0x80/0xb0
> > [ 0.241189]
> > [ 0.241606] write to 0xffff00001003bd00 of 4 bytes by task 24 on cpu 3:
> > [ 0.243435] set_state+0x80/0xb0
> > [ 0.244328] multi_cpu_stop+0x16c/0x198
> > [ 0.245406] cpu_stopper_thread+0x170/0x298
> > [ 0.246565] smpboot_thread_fn+0x40c/0x560
> > [ 0.247696] kthread+0x1a8/0x1b0
> > [ 0.248586] ret_from_fork+0x10/0x18
> > [ 0.249589]
> > [ 0.250006] read to 0xffff00001003bd00 of 4 bytes by task 14 on cpu 1:
> > [ 0.251804] multi_cpu_stop+0xa8/0x198
> > [ 0.252851] cpu_stopper_thread+0x170/0x298
> > [ 0.254008] smpboot_thread_fn+0x40c/0x560
> > [ 0.255135] kthread+0x1a8/0x1b0
> > [ 0.256027] ret_from_fork+0x10/0x18
> > [ 0.257036]
> > [ 0.257449] Reported by Kernel Concurrency Sanitizer on:
> > [ 0.258918] CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.3.0-00007-g67ab35a199f4-dirty #3
> > [ 0.261241] Hardware name: linux,dummy-virt (DT)
> > [ 0.262517] ==================================================================>

Thanks, the fixes in -with-fixes were ones I only encountered with
Syzkaller, where I disable KCSAN during boot. I've just added a fix
for this race and pushed to kcsan-with-fixes.

> > > To those of you who we mentioned at LPC that we're working on a
> > > watchpoint-based KTSAN inspired by DataCollider [1], this is it (we
> > > renamed it to KCSAN to avoid confusion with KTSAN).
> > > [1] http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf
> > >
> > > In the coming weeks we're planning to:
> > > * Set up a syzkaller instance.
> > > * Share the dashboard so that you can see the races that are found.
> > > * Attempt to send fixes for some races upstream (if you find that the
> > > kcsan-with-fixes branch contains an important fix, please feel free to
> > > point it out and we'll prioritize that).
> > >
> > > There are a few open questions:
> > > * The big one: most of the reported races are due to unmarked
> > > accesses; prioritization or pruning of races to focus initial efforts
> > > to fix races might be required. Comments on how best to proceed are
> > > welcome. We're aware that these are issues that have recently received
> > > attention in the context of the LKMM
> > > (https://lwn.net/Articles/793253/).
> >
> > I think the big risk here is drive-by "fixes" masking the warnings
> > rather than fixing the actual issue. It's easy for people to suppress a
> > warning with {READ,WRITE}_ONCE(), so they're liable to do that even the
> > resulting race isn't benign.
> >
> > I don't have a clue how to prevent that, though.
>
> I think this is mostly orthogonal problem. E.g. for some syzbot bugs I
> see fixes that also try to simply "shut up" the immediate
> manifestation with whatever means, e.g. sprinkling some slinlocks. So
> (1) it's not unique to atomics, (2) presence of READ/WRITE_ONCE will
> make the reader aware of the fact that this runs concurrently with
> something else, and then they may ask themselves why this runs
> concurrently with something when the object is supposed to be private
> to the thread, and then maybe they re-fix it properly. Whereas if it's
> completely unmarked, nobody will even notice that this code accesses
> the object concurrently with other code. So even if READ/WRITE_ONCE
> was a wrong fix, it's still better to have it rather than not.