Re: [syzbot] KCSAN: data-race in start_this_handle / start_this_handle

From: Paul E. McKenney
Date: Fri Mar 19 2021 - 13:24:22 EST


On Fri, Mar 19, 2021 at 11:15:42PM +0900, Tetsuo Handa wrote:
> On 2021/03/12 0:54, Marco Elver wrote:
> >> But the more we could have the compiler automatically figure out
> >> things without needing an explicit tag, it would seem to me that this
> >> would be better, since manual tagging is going to be more error-prone.
> >
> > What you're alluding to here would go much further than a data race
> > detector ("data race" is still just defined by the memory model). The
> > wish that there was a static analysis tool that would automatically
> > understand the "concurrency semantics as intended by the developer" is
> > something that'd be nice to have, but just doesn't seem realistic.
> > Because how can a tool tell what the developer intended, without input
> > from that developer?
>
> Input from developers is very important for not only compilers and tools
> but also allowing bug-explorers to understand what is happening.
> ext4 currently has
>
> possible deadlock in start_this_handle (2)
> https://syzkaller.appspot.com/bug?id=38c060d5757cbc13fdffd46e80557c645fbe79ba
>
> which even maintainers cannot understand what is happening.
> How can bug-explorers know implicit logic which maintainers believe safe and correct?
> It is possible that some oversight in implicit logic is the cause of
> "possible deadlock in start_this_handle (2)".
> Making implicit assumptions clear helps understanding.

Just to be clear, the above diagnostic is from lockdep rather than KCSAN.

According to the sample crash result, different code paths acquire
jdb2_handle and the __fs_reclaim_map in different orders. It looks
to me that __fs_reclaim_map isn't really a lock, but rather a mode
indicator. If so, lockdep should set it up accordingly, perhaps
in a manner similar to rcu_lock_map.

> Will "KCSAN: data-race in start_this_handle / start_this_handle" be addressed by marking?
> syzbot is already waiting for
> "KCSAN: data-race in jbd2_journal_dirty_metadata / jbd2_journal_dirty_metadata" at
> https://syzkaller.appspot.com/bug?id=5eb10023f53097f003e72c6a7c1a6f14b7c22929 .

The first thing is to work out what the code should be doing. What KCSAN
is saying is that a variable is being locklessly updated. Is it really
OK for that variable to be locklessly updated? If not, a larger fix
is required.

For more information, please see Marco's LWN series:
https://lwn.net/Articles/816850/ and https://lwn.net/Articles/816854/

Alternatively, you can refer to the documentation being proposed for
the Linux kernel tree:

https://lore.kernel.org/lkml/20210304004543.25364-3-paulmck@xxxxxxxxxx/

> > If there's worry marking accesses is error-prone, then that might be a
> > signal that the concurrency design is too complex (or the developer
> > hasn't considered all cases).
> >
> > For that reason, we need to mark accesses to tell the compiler and
> > tooling where to expect concurrency, so that 1) the compiler generates
> > correct code, and 2) tooling such as KCSAN can double-check what the
> > developer intended is actually what's happening.
>
> and 3) bug-explorers can understand what the developers are assuming/missing.

If the above information doesn't help the bug explorers, please let me
know.

Thanx, Paul