Re: [PATCH] locking/osq_lock: fix a data race in osq_wait_next
From: Marco Elver
Date: Tue Jan 28 2020 - 06:46:41 EST
On Tue, 28 Jan 2020 at 04:11, Qian Cai <cai@xxxxxx> wrote:
>
> > On Jan 23, 2020, at 4:39 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Wed, Jan 22, 2020 at 06:54:43PM -0500, Qian Cai wrote:
> >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> >> index 1f7734949ac8..832e87966dcf 100644
> >> --- a/kernel/locking/osq_lock.c
> >> +++ b/kernel/locking/osq_lock.c
> >> @@ -75,7 +75,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> >> * wait for either @lock to point to us, through its Step-B, or
> >> * wait for a new @node->next from its Step-C.
> >> */
> >> - if (node->next) {
> >> + if (READ_ONCE(node->next)) {
> >> next = xchg(&node->next, NULL);
> >> if (next)
> >> break;
> >
> > This could possibly trigger the warning, but is a false positive. The
> > above doesn't fix anything in that even if that load is shattered the
> > code will function correctly -- it checks for any !0 value, any byte
> > composite that is !0 is sufficient.
> >
> > This is in fact something KCSAN compiler infrastructure could deduce.
Not in the general case. As far as I can tell, this if-statement is
purely optional and an optimization to avoid false sharing. This is
specific knowledge about the logic that (without conveying more
details about the logic) the tool couldn't safely deduce. Consider the
case:
T0:
if ( (x = READ_ONCE(ptr)) ) use_ptr_value(*x);
T1:
WRITE_ONCE(ptr, valid_ptr);
Here, unlike the case above, reading ptr without READ_ONCE can clearly
be dangerous.
The false sharing scenario came up before, and maybe it's worth
telling the tool about the logic. In fact, the 'data_race()' macro is
perfectly well suited to do this.
>
> Marco, any thought on improving KCSAN for this to reduce the false
> positives?
Define 'false positive'.
>From what I can tell, all 'false positives' that have come up are data
races where the consequences on the behaviour of the code is
inconsequential. In other words, all of them would require
understanding of the intended logic of the code, and understanding if
the worst possible outcome of a data race changes the behaviour of the
code in such a way that we may end up with an erroneously behaving
system.
As I have said before, KCSAN (or any data race detector) by definition
only works at the language level. Any semantic analysis, beyond simple
rules (such as ignore same-value stores) and annotations, is simply
impossible since the tool can't know about the logic that the
programmer intended.
That being said, if there are simple rules (like ignore same-value
stores) or other minimal annotations that can help reduce such 'false
positives', more than happy to add them.
Qian: firstly I suggest you try
CONFIG_KCSAN_REPORT_ONCE_IN_MS=1000000000 as mentioned before so your
system doesn't get spammed, considering you do not use the default
config but want to use all debugging tools at once which seems to
trigger certain data races more than usual.
Secondly, what are your expectations? If you expect the situation to
be perfect tomorrow, you'll be disappointed. This is inherent, given
the problem we face (safe concurrency). Consider the various parts to
this story: concurrent kernel code, the LKMM, people's preferences and
opinions, and KCSAN (which is late to the party). All of them are
still evolving, hopefully together. At least that's my expectation.
What to do about osq_lock here? If people agree that no further
annotations are wanted, and the reasoning above concludes there are no
bugs, we can blacklist the file. That would, however, miss new data
races in future.
Thanks,
-- Marco