Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

From: Paul E. McKenney
Date: Tue Jul 17 2018 - 15:36:13 EST


On Tue, Jul 17, 2018 at 11:44:23AM -0700, Linus Torvalds wrote:
> On Tue, Jul 17, 2018 at 11:31 AM Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > The isync provides ordering roughly similar to lwsync, but nowhere near
> > as strong as sync, and it is sync that would be needed to cause lock
> > acquisition to provide full ordering.
>
> That's only true when looking at isync in isolation.
>
> Read the part I quoted. The AIX documentation implies that the
> *sequence* of load-compare-conditional branch-isync is a memory
> barrier, even if isync on its own is now.

You are referring to this URL, correct?

https://www.ibm.com/developerworks/systems/articles/powerpc.html

If so, the "this ordering property" refers to the ordering needed to
ensure that the later accesses happen after the "load global flag",
and lwsync suffices for that. (As does the branch-isync, as you say.)
The sync instruction provides much stronger ordering due to the fact
that sync flushes the store buffer and waits for acknowledgments from all
other CPUs (or, more accurately makes it appear as if it had done so --
speculative execution can be and is used to hide some of the latency).

In contrast, isync (with or without the load and branch) does not
flush the store buffer nor does it cause any particular communication
with the other CPUs. The ordering that isync provides is therefore
quite a bit weaker than that of the sync instruction.

> So I'm just saying that
>
> (a) isync-on-lock is supposed to be much cheaper than sync-on-lock

Completely agreed.

> (b) the AIX documentation at least implies that isync-on-lock (when
> used together the the whole locking sequence) is actually a memory
> barrier

Agreed, but the ordering properties that it provides are similar to
those of the weaker lwsync memory-barrier instruction, and nowhere near
as strong of those of the sync memory-barrier instruction.

> Now, admittedly the powerpc barrier instructions are unfathomable
> crazy stuff, so who knows. But:
>
> (a) lwsync is a memory barrier for all the "easy" cases (ie
> load->store, load->load, and store->load).

Yes.

> (b) lwsync is *not* a memory barrier for the store->load case.

Agreed.

> (c) isync *is* (when in that *sequence*) a memory barrier for a
> store->load case (and has to be: loads inside a spinlocked region MUST
> NOT be done earlier than stores outside of it!).

Yes, isync will wait for the prior stores to -execute-, but it doesn't
necessarily wait for the corresponding entries to leave the store buffer.
And this suffices to provide ordering from the viewpoint of some other
CPU holding the lock.

> So a unlock/lock sequence where the unlock is using lwsync, and the
> lock is using isync, should in fact be a full memory barrier (which is
> the semantics we're looking for).
>
> So doing performance testing on sync/lwsync (for lock/unlock
> respectively) seems the wrong thing to do. Please test the
> isync/lwsync case instead.
>
> Hmm? What am I missing?

That the load-branch-isync has ordering properties similar to the lwsync
memory-barrier instruction, not to the sync memory-barrier instruction.
This means that the isync/lwsync case simply won't provide full ordering
from the viewpoint of CPUs not holding the lock. (As with lwsync,
CPUs holding the lock do see full ordering, as is absolutely required.)

You absolutely must use sync/lwsync or lwsync/sync to get the strong
order that is visible to other CPUs not holding the lock.

The PowerPC hardware verification tools agree, as shown below.

Thanx, Paul

------------------------------------------------------------------------

The PPCMEM tools (which the Power hardware guys helped to develop) agrees.
For the following litmus test, in which P0 does the lwsync-unlock and
isync-lock, and for which P1 checks the ordering, but without acquiring
the lock, PPCMEM says "no ordering".

------------------------------------------------------------------------

PPC lock-1thread-WR-barrier.litmus
""
{
l=1;
0:r1=1; 0:r3=42; 0:r4=x; 0:r5=y; 0:r10=0; 0:r11=0; 0:r12=l;
1:r1=1; 1:r4=x; 1:r5=y;
}
P0 | P1 ;
stw r1,0(r4) | stw r1,0(r5) ;
lwsync | sync ;
stw r10,0(r12) | lwz r7,0(r4) ;
lwarx r11,r10,r12 | ;
cmpwi r11,0 | ;
bne Fail1 | ;
stwcx. r1,r10,r12 | ;
bne Fail1 | ;
isync | ;
lwz r3,0(r5) | ;
Fail1: | ;


exists
(0:r3=0 /\ 1:r7=0)

------------------------------------------------------------------------

Here is the output of running the tool locally:

------------------------------------------------------------------------

0:r3=0; 1:r7=0;
0:r3=0; 1:r7=1;
0:r3=1; 1:r7=0;
0:r3=1; 1:r7=1;
0:r3=42; 1:r7=0;
0:r3=42; 1:r7=1;
Ok
Condition exists (0:r3=0 /\ 1:r7=0)
Hash=16c3d58e658f6c16bc3df7d6233d8bf8
Observation SB+lock+sync-Linus Sometimes 1 5

------------------------------------------------------------------------

And you can see the "0:r3=0; 1:r7=0;" state, which indicates that neither
process's loads saw the other process's stores, which in turn indicates
no ordering.

That said, if P1 actually acquired the lock, there would be ordering.

Or you can get the same result manually using this web site:

https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html