Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

From: Andrea Parri
Date: Thu Sep 27 2018 - 03:48:01 EST


On Thu, Sep 27, 2018 at 09:17:47AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 26, 2018 at 10:52:08PM +0200, Andrea Parri wrote:
> > On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> > > On x86 we cannot do fetch_or with a single instruction and end up
> > > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > > with a very tricky composite xchg8 + load.
> > >
> > > The basic idea is that we use xchg8 to test-and-set the pending bit
> > > (when it is a byte) and then a load to fetch the whole word. Using
> > > two instructions of course opens a window we previously did not have.
> > > In particular the ordering between pending and tail is of interrest,
> > > because that is where the split happens.
> > >
> > > The claim is that if we order them, it all works out just fine. There
> > > are two specific cases where the pending,tail state changes:
> > >
> > > - when the 3rd lock(er) comes in and finds pending set, it'll queue
> > > and set tail; since we set tail while pending is set, the ordering
> > > is split is not important (and not fundamentally different form
> > > fetch_or). [*]
> > >
> > > - when the last queued lock holder acquires the lock (uncontended),
> > > we clear the tail and set the lock byte. By first setting the
> > > pending bit this cmpxchg will fail and the later load must then
> > > see the remaining tail.
> > >
> > > Another interesting scenario is where there are only 2 threads:
> > >
> > > lock := (0,0,0)
> > >
> > > CPU 0 CPU 1
> > >
> > > lock() lock()
> > > trylock(-> 0,0,1) trylock() /* fail */
> > > return; xchg_relaxed(pending, 1) (-> 0,1,1)
> > > mb()
> > > val = smp_load_acquire(*lock);
> > >
> > > Where, without the mb() the load would've been allowed to return 0 for
> > > the locked byte.
> >
> > If this were true, we would have a violation of "coherence":
>
> The thing is, this is mixed size, see:

The accesses to ->val are not, and those certainly have to meet the
"coherence" constraint (no matter the store to ->pending).


>
> https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf
>
> If I remember things correctly (I've not reread that paper recently) it
> is allowed for:
>
> old = xchg(pending,1);
> val = smp_load_acquire(*lock);
>
> to be re-ordered like:
>
> val = smp_load_acquire(*lock);
> old = xchg(pending, 1);
>
> with the exception that it will fwd the pending byte into the later
> load, so we get:
>
> val = (val & _Q_PENDING_MASK) | (old << _Q_PENDING_OFFSET);
>
> for 'free'.
>
> LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.

True, but it is nothing conceptually new to deal with: there're Cat
models that handle mixed-size accesses, just give it time.

Andrea


>
> With the addition of smp_mb__after_atomic(), we disallow the load to be
> done prior to the xchg(). It might still fwd the more recent pending
> byte from its store buffer, but at least the other bytes must not be
> earlier.