Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()

From: Will Deacon
Date: Tue Nov 17 2015 - 06:51:20 EST

Hi Linus,

On Mon, Nov 16, 2015 at 01:58:49PM -0800, Linus Torvalds wrote:
> On Mon, Nov 16, 2015 at 8:24 AM, Will Deacon <will.deacon@xxxxxxx> wrote:
> >
> > ... or we upgrade spin_unlock_wait to a LOCK operation, which might be
> > slightly cheaper than spin_lock()+spin_unlock().
> So traditionally the real concern has been the cacheline ping-pong
> part of spin_unlock_wait(). I think adding a memory barrier (that
> doesn't force any exclusive states, just ordering) to it is fine, but
> I don't think we want to necessarily have it have to get the cacheline
> into exclusive state.

The problem is, I don't think the memory-barrier buys you anything in
the context of Boqun's example. In fact, he already had smp_mb() either
side of the spin_unlock_wait() and its still broken on arm64 and ppc.

Paul is proposing adding a memory barrier after spin_lock() in the racing
thread, but I personally think people will forget to add that.

> Because if spin_unlock_wait() ends up having to get the spinlock
> cacheline (for example, by writing the same value back with a SC), I
> don't think spin_unlock_wait() will really be all that much cheaper
> than just getting the spinlock, and in that case we shouldn't play
> complicated ordering games.

It was the lock-fairness guarantees that I was concerned about. A
spin_lock could place you into a queue, so you're no longer waiting for
a single spin_unlock(), you're now waiting for *all* the spin_unlocks
by the CPUs preceding you in the queue.

> On another issue:
> I'm also looking at the ARM documentation for strx, and the
> _documentation_ says that it has no stronger ordering than a "store
> release", but I'm starting to wonder if that is actually true.
> Because I do end up thinking that it does have the same "control
> dependency" to all subsequent writes (but not reads). So reads after
> the SC can percolate up, but I think writes are restricted.
> Why? In order for the SC to be able to return success, the write
> itself may not have been actually done yet, but the cacheline for the
> write must have successfully be turned into exclusive ownership.
> Agreed?

If the LL/SC logic hangs off the coherency logic, then yes, but there
are other ways to build this (using a seperate "exclusive monitor" block)
and the architecture caters for this, too. See below.

> That means that by the time a SC returns success, no other CPU can see
> the old value of the spinlock any more. So by the time any subsequent
> stores in the locked region can be visible to any other CPU's, the
> locked value of the lock itself has to be visible too.
> Agreed?

No. A successful SC is *not* multi-copy atomic, but you're right to
point out that it provides more guarantees than a plain store. In
particular, a successful SC cannot return its success value until its
corresponding write has fixed its place in the coherence order for the
location that it is updating.

To be more concrete (simplified AArch64 asm, X1 and X2 hold addresses of
zero-initialised locations):

LDXR X0, [X1]
ADD X0, X0, #1
STXR X0, [X1] // Succeeds

LDR X0, [X1] // Reads 1
STR #1, [X2]

LDR X0, [X2] // Reads 1
LDR X0, [X1] // **Not required to read 1**


LDXR X0, [X1]
ADD X0, X0, #1
STXR X0, [X1] // Succeeds

LDR X0, [X1] // Reads 1
STR #1, [X2]

LDR X0, [X2] // Reads 1
STR #2, [X1] // Location at [X1] must be ordered {0->1->2}

We can also extend this example so that P2 instead does:

LDR X0, [X2] // Reads 1
LDXR X0, [X1]
ADD X0, X0, #1
STXR X0, [X1] // Succeeds; location at [x1] must be ordered {0->1->2}

i.e. the STXR cannot succeed until the coherence order has been resolved,
which also requires the LDXR to return the up-to-date value.

> So I think that in effect, when a spinlock is implemnted with LL/SC,
> the loads inside the locked region are only ordered wrt the acquire on
> the LL, but the stores can be considered ordered wrt the SC.
> No?

I initially fell into the same trap because a control dependency between
a load and a store is sufficient to create order (i.e. we don't speculate
writes). An SC is different, though, because the control dependency can
be resolved without the write being multi-copy atomic, whereas a read is
required to return its data (i.e. complete) before the control hazard can
be resolved.

I think that all of this means I should either:

(1) Update the arm64 spin_unlock_wait to use LDXR/STXR (perhaps with
acquire semantics?)

- or -

(2) Replace spin_unlock_wait with spin_lock; spin_unlock, my worries
about queuing notwithstanding.

The cacheline ping-pong in (1) can be mitigated somewhat by the use of
wfe, so it won't be any worse than a spin_lock().

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at