Re: [PATCH 0/2] jump label: 2.6.38 updates

From: Will Simoneau
Date: Tue Feb 15 2011 - 16:20:44 EST


On 11:01 Tue 17 Feb , Will Newton wrote:
> On Mon, Feb 14, 2011 at 11:19 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
> > On 02/14/2011 02:37 PM, Matt Fleming wrote:
> >>>
> >>> I don't see how cache coherency can possibly work if the hardware
> >>> behaves this way.
> >>
> >> Cache coherency is still maintained provided writes/reads both go
> >> through the cache ;-)
> >>
> >> The problem is that for read-modify-write operations the arbitration
> >> logic that decides who "wins" and is allowed to actually perform the
> >> write, assuming two or more CPUs are competing for a single memory
> >> address, is not implemented in the cache controller, I think. I'm not a
> >> hardware engineer and I never understood how the arbitration logic
> >> worked but I'm guessing that's the reason that the ll/sc instructions
> >> bypass the cache.
> >>
> >> Which is why the atomic_t functions worked out really well for that
> >> arch, such that any accesses to an atomic_t * had to go through the
> >> wrapper functions.
> >
> > I'm sorry... this doesn't compute. ?Either reads can work normally (go
> > through the cache) in which case atomic_read() can simply be a read or
> > they don't, so I don't understand this at all.
>
> The CPU in question has two sets of instructions:
>
> load/store - these go via the cache (write through)
> ll/sc - these operate literally as if there is no cache (they do not
> hit on read or write)
>
> This may or may not be a sensible way to architect a CPU, but I think
> it is possible to make it work. Making it work efficiently is more of
> a challenge.

Speaking as a (non-commercial) processor designer here, but feel free to point
out anything I'm wrong on. I have direct experience implementing these
operations in hardware so I'd hope what I say here is right. This information
is definitely relevant to the MIPS R4000 as well as my own hardware. A quick
look at the PPC documentation seems to indicate it's the same there too, and it
should agree with the Wikipedia article on the subject:
http://en.wikipedia.org/wiki/Load-link/store-conditional

The entire point of implementing load-linked (ll) / store-conditional (sc)
instructions is to have lockless atomic primitives *using the cache*. Proper
implementations do not bypass the cache; in fact, the cache coherence protocol
must get involved for them to be correct. If properly implemented, these
operations cause no external bus traffic if the critical section is uncontended
and hits the cache (good for scalability). These are the semantics:

ll: Essentially the same as a normal word load. Implementations will need to do
a little internal book-keeping (i.e. save physical address of last ll
instruction and/or modify coherence state for the cacheline).

sc: Store a word if and only if the address has not been written by any other
processor since the last ll. If the store fails, write 0 to a register,
otherwise write 1.

The address may be tracked on cacheline granularity; this operation may
spuriously fail, depending on implementation details (called "weak" ll/sc).
Arguably the "obvious" way to implement this is to have sc fail if the local
CPU snoops a read-for-ownership for the address in question coming from a
remote CPU. This works because the remote CPU will need to gain the cacheline
for exclusive access before its competing sc can execute. Code is supposed to
put ll/sc in a loop and simply retry the operation until the sc succeeds.

Note how the cache and cache coherence protocol are fundamental parts of this
operation; if these instructions simply bypassed the cache, they *could not*
work correctly - how do you detect when the underlying memory has been
modified? You can't simply detect whether the value has changed - it may have
been changed to another value and then back ("ABA" problem). You have to snoop
bus transactions, and that is what the cache and its coherence algorithm
already do. ll/sc can be implemented entirely using the side-effects of the
cache coherence algorithm; my own working hardware implementation does this.

So, atomically reading the variable can be accomplished with a normal load
instruction. I can't speak for unaligned loads on implementations that do them
in hardware, but at least an aligned load of word size should be atomic on any
sane architecture. Only an atomic read-modify-write of the variable needs to
use ll/sc at all, and only for the reason of preventing another concurrent
modification between the load and store. A plain aligned word store should be
atomic too, but it's not too useful because a another concurrent store would
not be ordered relative to the local store.

Attachment: pgp00000.pgp
Description: PGP signature