Re: Kernel Concurrency Sanitizer (KCSAN)
From: Joel Fernandes
Date: Fri Oct 04 2019 - 14:08:54 EST
On Fri, Oct 04, 2019 at 07:01:37PM +0200, Dmitry Vyukov wrote:
> On Fri, Oct 4, 2019 at 6:57 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> >
> > On Fri, Oct 04, 2019 at 06:52:49PM +0200, Dmitry Vyukov wrote:
> > > On Fri, Oct 4, 2019 at 6:49 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Oct 02, 2019 at 09:51:58PM +0200, Marco Elver wrote:
> > > > > Hi Joel,
> > > > >
> > > > > On Tue, 1 Oct 2019 at 23:19, Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Fri, Sep 20, 2019 at 04:18:57PM +0200, Marco Elver wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We would like to share a new data-race detector for the Linux kernel:
> > > > > > > Kernel Concurrency Sanitizer (KCSAN) --
> > > > > > > https://github.com/google/ktsan/wiki/KCSAN (Details:
> > > > > > > https://github.com/google/ktsan/blob/kcsan/Documentation/dev-tools/kcsan.rst)
> > > > > > >
> > > > > > > To those of you who we mentioned at LPC that we're working on a
> > > > > > > watchpoint-based KTSAN inspired by DataCollider [1], this is it (we
> > > > > > > renamed it to KCSAN to avoid confusion with KTSAN).
> > > > > > > [1] http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf
> > > > > > >
> > > > > > > In the coming weeks we're planning to:
> > > > > > > * Set up a syzkaller instance.
> > > > > > > * Share the dashboard so that you can see the races that are found.
> > > > > > > * Attempt to send fixes for some races upstream (if you find that the
> > > > > > > kcsan-with-fixes branch contains an important fix, please feel free to
> > > > > > > point it out and we'll prioritize that).
> > > > > > >
> > > > > > > There are a few open questions:
> > > > > > > * The big one: most of the reported races are due to unmarked
> > > > > > > accesses; prioritization or pruning of races to focus initial efforts
> > > > > > > to fix races might be required. Comments on how best to proceed are
> > > > > > > welcome. We're aware that these are issues that have recently received
> > > > > > > attention in the context of the LKMM
> > > > > > > (https://lwn.net/Articles/793253/).
> > > > > > > * How/when to upstream KCSAN?
> > > > > >
> > > > > > Looks exciting. I think based on our discussion at LPC, you mentioned
> > > > > > one way of pruning is if the compiler generated different code with _ONCE
> > > > > > annotations than what would have otherwise been generated. Is that still on
> > > > > > the table, for the purposing of pruning the reports?
> > > > >
> > > > > This might be interesting at first, but it's not entirely clear how
> > > > > feasible it is. It's also dangerous, because the real issue would be
> > > > > ignored. It may be that one compiler version on a particular
> > > > > architecture generates the same code, but any change in compiler or
> > > > > architecture and this would no longer be true. Let me know if you have
> > > > > any more ideas.
> > > >
> > > > My thought was this technique of looking at compiler generated code can be
> > > > used for prioritization of the reports. Have you tested it though? I think
> > > > without testing such technique, we could not know how much of benefit (or
> > > > lack thereof) there is to the issue.
> > > >
> > > > In fact, IIRC, the compiler generating different code with _ONCE annotation
> > > > can be given as justification for patches doing such conversions.
> > >
> > >
> > > We also should not forget about "missed mutex" races (e.g. unprotected
> > > radix tree), which are much worse and higher priority than a missed
> > > atomic annotation. If we look at codegen we may discard most of them
> > > as non important.
> >
> > Sure. I was not asking to look at codegen as the only signal. But to use the
> > signal for whatever it is worth.
>
> But then we need other, stronger signals. We don't have any.
> So if the codegen is the only one and it says "this is not important",
> then we conclude "this is not important".
I didn't mean for codegen to say "this is not important", but rather "this IS
important". And for the other ones, "this may not be important, or it may
be very important, I don't know".
Why do you say a missed atomic anotation is lower priority? A bug is a bug,
and ought to be fixed IMHO. Arguably missing lock acquisition can be detected
more easily due to lockdep assertions and using lockdep, than missing _ONCE
annotations. The latter has no way of being detected at runtime easily and
can be causing failures in mysterious ways.
I think you can divide the problem up.. One set of bugs that are because of
codegen changes and data races and are "important" for that reason. Another
one that is less clear whether they are important or not -- until you have a
better way of providing a signal for categorizing those.
Did I miss something?
thanks,
- Joel