Re: [RESEND PATCH v5] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

From: Peter Zijlstra
Date: Wed Aug 09 2017 - 11:06:15 EST


On Wed, May 24, 2017 at 09:38:28AM -0400, Waiman Long wrote:

> @@ -361,6 +361,13 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
> * observe its next->locked value and advance itself.
> *
> * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
> + *
> + * The write to next->locked in arch_mcs_spin_unlock_contended()
> + * must be ordered before the read of pn->state in the cmpxchg()
> + * below for the code to work correctly. However, this is not
> + * guaranteed on all architectures when the cmpxchg() call fails.
> + * Both x86 and PPC can provide that guarantee, but other
> + * architectures not necessarily.
> */
> if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
> return;

Instead of documenting this, should we not fix it properly?

So what we want is to order:

smp_store_release(&x, 1);
cmpxchg(&y, 0, 1);

Such that the store to x is before the load of y. Now, we document
cmpxchg() to have smp_mb() before and smp_mb() after (if success). So
per that definition, there would appear no way the load of y can be
reordered before the store to x.

Now, ARM64 for instance plays funny games, it does something along the
lines of:

cmpxchg(ptr, old, new)
{
do {
r = LL(ptr);
if (r != old)
return r; /* no barriers */
r = new
} while (SC_release(ptr, r));
smp_mb();
return r;
}

Thereby ordering things relative to the store on ptr, but the load can
very much escape. The thinking is that if success, we must observe the
latest value of ptr, but even in that case the load is not ordered and
could happen before.

However, since we're guaranteed to observe the latest value of ptr (on
success) it doesn't matter if we reordered the load, there is no newer
value possible.

So heaps of tricky, but correct afaict. Will?


Of course, since we need that load to be ordered even in case of a
failed cmpxchg() we _should_ add an unconditional smp_mb() here. Which I
understand you not wanting to do. Even smp_mb__before_atomic() is no
help, because that's smp_mb() for both PPC and ARM64.