Re: Kernel Concurrency Sanitizer (KCSAN)

From: Dmitry Vyukov
Date: Fri Oct 04 2019 - 14:28:57 EST


" On Fri, Oct 4, 2019 at 8:08 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > We would like to share a new data-race detector for the Linux kernel:
> > > > > > > > Kernel Concurrency Sanitizer (KCSAN) --
> > > > > > > > https://github.com/google/ktsan/wiki/KCSAN (Details:
> > > > > > > > https://github.com/google/ktsan/blob/kcsan/Documentation/dev-tools/kcsan.rst)
> > > > > > > >
> > > > > > > > To those of you who we mentioned at LPC that we're working on a
> > > > > > > > watchpoint-based KTSAN inspired by DataCollider [1], this is it (we
> > > > > > > > renamed it to KCSAN to avoid confusion with KTSAN).
> > > > > > > > [1] http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf
> > > > > > > >
> > > > > > > > In the coming weeks we're planning to:
> > > > > > > > * Set up a syzkaller instance.
> > > > > > > > * Share the dashboard so that you can see the races that are found.
> > > > > > > > * Attempt to send fixes for some races upstream (if you find that the
> > > > > > > > kcsan-with-fixes branch contains an important fix, please feel free to
> > > > > > > > point it out and we'll prioritize that).
> > > > > > > >
> > > > > > > > There are a few open questions:
> > > > > > > > * The big one: most of the reported races are due to unmarked
> > > > > > > > accesses; prioritization or pruning of races to focus initial efforts
> > > > > > > > to fix races might be required. Comments on how best to proceed are
> > > > > > > > welcome. We're aware that these are issues that have recently received
> > > > > > > > attention in the context of the LKMM
> > > > > > > > (https://lwn.net/Articles/793253/).
> > > > > > > > * How/when to upstream KCSAN?
> > > > > > >
> > > > > > > Looks exciting. I think based on our discussion at LPC, you mentioned
> > > > > > > one way of pruning is if the compiler generated different code with _ONCE
> > > > > > > annotations than what would have otherwise been generated. Is that still on
> > > > > > > the table, for the purposing of pruning the reports?
> > > > > >
> > > > > > This might be interesting at first, but it's not entirely clear how
> > > > > > feasible it is. It's also dangerous, because the real issue would be
> > > > > > ignored. It may be that one compiler version on a particular
> > > > > > architecture generates the same code, but any change in compiler or
> > > > > > architecture and this would no longer be true. Let me know if you have
> > > > > > any more ideas.
> > > > >
> > > > > My thought was this technique of looking at compiler generated code can be
> > > > > used for prioritization of the reports. Have you tested it though? I think
> > > > > without testing such technique, we could not know how much of benefit (or
> > > > > lack thereof) there is to the issue.
> > > > >
> > > > > In fact, IIRC, the compiler generating different code with _ONCE annotation
> > > > > can be given as justification for patches doing such conversions.
> > > >
> > > >
> > > > We also should not forget about "missed mutex" races (e.g. unprotected
> > > > radix tree), which are much worse and higher priority than a missed
> > > > atomic annotation. If we look at codegen we may discard most of them
> > > > as non important.
> > >
> > > Sure. I was not asking to look at codegen as the only signal. But to use the
> > > signal for whatever it is worth.
> >
> > But then we need other, stronger signals. We don't have any.
> > So if the codegen is the only one and it says "this is not important",
> > then we conclude "this is not important".
>
> I didn't mean for codegen to say "this is not important", but rather "this IS
> important". And for the other ones, "this may not be important, or it may
> be very important, I don't know".
>
> Why do you say a missed atomic anotation is lower priority? A bug is a bug,

You started talking about prioritization ;)

> and ought to be fixed IMHO. Arguably missing lock acquisition can be detected
> more easily due to lockdep assertions and using lockdep, than missing _ONCE
> annotations. The latter has no way of being detected at runtime easily and
> can be causing failures in mysterious ways.
>
> I think you can divide the problem up.. One set of bugs that are because of
> codegen changes and data races and are "important" for that reason. Another
> one that is less clear whether they are important or not -- until you have a
> better way of providing a signal for categorizing those.
>
> Did I miss something?

We have:
1. missed annotation with changing codegen.
2. missed annotation with non-changing codegen.
3. missed mutex with changing codegen.
4. missed mutex with non-changing codegen.

One can arguably say that 2 is less important than 1. But then both 3
and 4 are not low priority under any circumstances. And we don't have
any means to distinguish 1/2 from 3/4.
In this situation I don't see how "changing codegen" vs "non-changing
codegen" gives us any useful signal.

Assuming we have some signal for lower priority, the only useful way
of using this signal that I see is throwing lower priority bugs away
automatically for now (not reporting on syzbot). Because if we do
report all bugs and humans need to look at all of them anyway, this
signal is not too useful. If am already spending time on a report, I
can as well quickly prioritize it much more precisely than any
automatic scheme.

If we are not reporting lower priority bugs, we cannot offer to
classify "missed mutexes" as lower priority.