Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make *xchg and *cmpxchg a full barrier

From: Paul E. McKenney
Date: Thu Oct 15 2015 - 11:43:22 EST


On Thu, Oct 15, 2015 at 11:35:44AM +0100, Will Deacon wrote:
> Dammit guys, it's never simple is it?
>
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it. I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded. Here is the sequence of events:
> >
> > o Commit P0's write. The model offers to propagate this write
> > to the coherence point and to P1, but don't do so yet.
> >
> > o Commit P1's write. Similar offers, but don't take them up yet.
> >
> > o Commit P0's lwsync.
> >
> > o Execute P0's lwarx, which reads a=0. Then commit it.
> >
> > o Commit P0's stwcx. as successful. This stores a=1.
>
> On arm64, this is a conditional-store-*release* and therefore cannot be
> observed before the initial write to x...
>
> > o Commit P0's branch (not taken).
> >
> > o Commit P0's final register-to-register move.
> >
> > o Commit P1's sync instruction.
> >
> > o There is now nothing that can happen in either processor.
> > P0 is done, and P1 is waiting for its sync. Therefore,
> > propagate P1's a=2 write to the coherence point and to
> > the other thread.
>
> ... therefore this is illegal, because you haven't yet propagated that
> prior write...

OK. Power distinguishes between propagating to the coherence point
and to each of the other CPUs.

> > o There is still nothing that can happen in either processor.
> > So pick the barrier propagate, then the acknowledge sync.
> >
> > o P1 can now execute its read from x. Because P0's write to
> > x is still waiting to propagate to P1, this still reads
> > x=0. Execute and commit, and we now have both r3 registers
> > equal to zero and the final value a=2.
>
> ... and P1 would have to read x == 1.

Good! Do ARMMEM and herd agree with you?

> So arm64 is ok. Doesn't lwsync order store->store observability for PPC?

Yes. But this is not store->store observability, but rather store->load
visibility. Furthermore, as I understand it, lwsync controls the
visibility to other CPUs, but not necessarily the coherence order.

Let's look at the example C code again:

CPU 0 CPU 1
----- -----

WRITE_ONCE(x, 1); WRITE_ONCE(a, 2);
r3 = xchg(&a, 1); smp_mb();
r3 = READ_ONCE(x);

The problem is that we are applying intuitions obtained from a
release-acquire chain, which hands off from stores to loads. In contrast,
this example is quite weird in that we have a store handing off to another
store, but with reads also involved. Making that work on Power requires
full memory barriers on both sides. Intuitively, the coherence order
can be established after the fact as long as all readers see a consistent
set of values based on the subset of the sequence that each reader sees.

Anyway, it looks like Power does need a sync before and after for
value-returning atomics. That certainly simplifies the analysis.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/