Re: Kernel Concurrency Sanitizer (KCSAN)

From: Will Deacon
Date: Fri Sep 20 2019 - 11:54:29 EST


Hi Marco,

On Fri, Sep 20, 2019 at 04:18:57PM +0200, Marco Elver wrote:
> We would like to share a new data-race detector for the Linux kernel:
> Kernel Concurrency Sanitizer (KCSAN) --
> https://github.com/google/ktsan/wiki/KCSAN (Details:
> https://github.com/google/ktsan/blob/kcsan/Documentation/dev-tools/kcsan.rst)
>
> To those of you who we mentioned at LPC that we're working on a
> watchpoint-based KTSAN inspired by DataCollider [1], this is it (we
> renamed it to KCSAN to avoid confusion with KTSAN).
> [1] http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf

Oh, spiffy!

> In the coming weeks we're planning to:
> * Set up a syzkaller instance.
> * Share the dashboard so that you can see the races that are found.
> * Attempt to send fixes for some races upstream (if you find that the
> kcsan-with-fixes branch contains an important fix, please feel free to
> point it out and we'll prioritize that).

Curious: do you take into account things like alignment and/or access size
when looking at READ_ONCE/WRITE_ONCE? Perhaps you could initially prune
naturally aligned accesses for which __native_word() is true?

> There are a few open questions:
> * The big one: most of the reported races are due to unmarked
> accesses; prioritization or pruning of races to focus initial efforts
> to fix races might be required. Comments on how best to proceed are
> welcome. We're aware that these are issues that have recently received
> attention in the context of the LKMM
> (https://lwn.net/Articles/793253/).

This one is tricky. What I think we need to avoid is an onslaught of
patches adding READ_ONCE/WRITE_ONCE without a concrete analysis of the
code being modified. My worry is that Joe Developer is eager to get their
first patch into the kernel, so runs this tool and starts spamming
maintainers with these things to the point that they start ignoring KCSAN
reports altogether because of the time they take up.

I suppose one thing we could do is to require each new READ_ONCE/WRITE_ONCE
to have a comment describing the racy access, a bit like we do for memory
barriers. Another possibility would be to use atomic_t more widely if
there is genuine concurrency involved.

> * How/when to upstream KCSAN?

Start by posting the patches :)

Will