Re: Kernel Concurrency Sanitizer (KCSAN)

From: Dmitry Vyukov
Date: Fri Sep 20 2019 - 12:47:18 EST


On Fri, Sep 20, 2019 at 6:31 PM Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> On Fri, Sep 20, 2019 at 04:18:57PM +0200, Marco Elver wrote:
> > Hi all,
>
> Hi,
>
> > We would like to share a new data-race detector for the Linux kernel:
> > Kernel Concurrency Sanitizer (KCSAN) --
> > https://github.com/google/ktsan/wiki/KCSAN (Details:
> > https://github.com/google/ktsan/blob/kcsan/Documentation/dev-tools/kcsan.rst)
>
> Nice!
>
> BTW kcsan_atomic_next() is missing a stub definition in <linux/kcsan.h>
> when !CONFIG_KCSAN:
>
> https://github.com/google/ktsan/commit/a22a093a0f0d0b582c82cdbac4f133a3f61d207c#diff-19d7c475b4b92aab8ba440415ab786ec
>
> ... and I think the kcsan_{begin,end}_atomic() stubs need to be static
> inline too.
>
> It looks like this is easy enough to enable on arm64, with the only real
> special case being secondary_start_kernel() which we might want to
> refactor to allow some portions to be instrumented.
>
> I pushed the trivial patches I needed to get arm64 booting to my arm64/kcsan
> branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git arm64/kcsan
>
> We have some interesting splats at boot time in stop_machine, which
> don't seem to have been hit/fixed on x86 yet in the kcsan-with-fixes
> branch, e.g.
>
> [ 0.237939] ==================================================================
> [ 0.239431] BUG: KCSAN: data-race in multi_cpu_stop+0xa8/0x198 and set_state+0x80/0xb0
> [ 0.241189]
> [ 0.241606] write to 0xffff00001003bd00 of 4 bytes by task 24 on cpu 3:
> [ 0.243435] set_state+0x80/0xb0
> [ 0.244328] multi_cpu_stop+0x16c/0x198
> [ 0.245406] cpu_stopper_thread+0x170/0x298
> [ 0.246565] smpboot_thread_fn+0x40c/0x560
> [ 0.247696] kthread+0x1a8/0x1b0
> [ 0.248586] ret_from_fork+0x10/0x18
> [ 0.249589]
> [ 0.250006] read to 0xffff00001003bd00 of 4 bytes by task 14 on cpu 1:
> [ 0.251804] multi_cpu_stop+0xa8/0x198
> [ 0.252851] cpu_stopper_thread+0x170/0x298
> [ 0.254008] smpboot_thread_fn+0x40c/0x560
> [ 0.255135] kthread+0x1a8/0x1b0
> [ 0.256027] ret_from_fork+0x10/0x18
> [ 0.257036]
> [ 0.257449] Reported by Kernel Concurrency Sanitizer on:
> [ 0.258918] CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.3.0-00007-g67ab35a199f4-dirty #3
> [ 0.261241] Hardware name: linux,dummy-virt (DT)
> [ 0.262517] ==================================================================
>
> > To those of you who we mentioned at LPC that we're working on a
> > watchpoint-based KTSAN inspired by DataCollider [1], this is it (we
> > renamed it to KCSAN to avoid confusion with KTSAN).
> > [1] http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf
> >
> > In the coming weeks we're planning to:
> > * Set up a syzkaller instance.
> > * Share the dashboard so that you can see the races that are found.
> > * Attempt to send fixes for some races upstream (if you find that the
> > kcsan-with-fixes branch contains an important fix, please feel free to
> > point it out and we'll prioritize that).
> >
> > There are a few open questions:
> > * The big one: most of the reported races are due to unmarked
> > accesses; prioritization or pruning of races to focus initial efforts
> > to fix races might be required. Comments on how best to proceed are
> > welcome. We're aware that these are issues that have recently received
> > attention in the context of the LKMM
> > (https://lwn.net/Articles/793253/).
>
> I think the big risk here is drive-by "fixes" masking the warnings
> rather than fixing the actual issue. It's easy for people to suppress a
> warning with {READ,WRITE}_ONCE(), so they're liable to do that even the
> resulting race isn't benign.
>
> I don't have a clue how to prevent that, though.

I think this is mostly orthogonal problem. E.g. for some syzbot bugs I
see fixes that also try to simply "shut up" the immediate
manifestation with whatever means, e.g. sprinkling some slinlocks. So
(1) it's not unique to atomics, (2) presence of READ/WRITE_ONCE will
make the reader aware of the fact that this runs concurrently with
something else, and then they may ask themselves why this runs
concurrently with something when the object is supposed to be private
to the thread, and then maybe they re-fix it properly. Whereas if it's
completely unmarked, nobody will even notice that this code accesses
the object concurrently with other code. So even if READ/WRITE_ONCE
was a wrong fix, it's still better to have it rather than not.