Re: [PATCH] locking/osq_lock: fix a data race in osq_wait_next

From: Peter Zijlstra
Date: Tue Jan 28 2020 - 11:52:35 EST


On Tue, Jan 28, 2020 at 12:46:26PM +0100, Marco Elver wrote:
> On Tue, 28 Jan 2020 at 04:11, Qian Cai <cai@xxxxxx> wrote:
> >
> > > On Jan 23, 2020, at 4:39 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, Jan 22, 2020 at 06:54:43PM -0500, Qian Cai wrote:
> > >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > >> index 1f7734949ac8..832e87966dcf 100644
> > >> --- a/kernel/locking/osq_lock.c
> > >> +++ b/kernel/locking/osq_lock.c
> > >> @@ -75,7 +75,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> > >> * wait for either @lock to point to us, through its Step-B, or
> > >> * wait for a new @node->next from its Step-C.
> > >> */
> > >> - if (node->next) {
> > >> + if (READ_ONCE(node->next)) {
> > >> next = xchg(&node->next, NULL);
> > >> if (next)
> > >> break;
> > >
> > > This could possibly trigger the warning, but is a false positive. The
> > > above doesn't fix anything in that even if that load is shattered the
> > > code will function correctly -- it checks for any !0 value, any byte
> > > composite that is !0 is sufficient.
> > >
> > > This is in fact something KCSAN compiler infrastructure could deduce.
>
> Not in the general case. As far as I can tell, this if-statement is
> purely optional and an optimization to avoid false sharing. This is
> specific knowledge about the logic that (without conveying more
> details about the logic) the tool couldn't safely deduce. Consider the
> case:
>
> T0:
> if ( (x = READ_ONCE(ptr)) ) use_ptr_value(*x);
>
> T1:
> WRITE_ONCE(ptr, valid_ptr);
>
> Here, unlike the case above, reading ptr without READ_ONCE can clearly
> be dangerous.

There is a very big difference here though. In the osq case the result
of the load is only every compared to 0, after which the value is
discarded. While in your example you let the variable escape and use it
again later.

I'm claiming that in the first case, the only thing that's ever done
with a racy load is comparing against 0, there is no possible bad
outcome ever. While obviously if you let the load escape, or do anything
other than compare against 0, there is.