RE: spin_unlock optimization(i386)

Marco Colombo (marco@esi.it)
Sat, 27 Nov 1999 18:34:54 +0100 (MET)


Thank you for your answer. It (combined with the morning light) cleared
things up. I was simply wrong, mixing different issues (instruction
order, cache behavior and serialization) together instead of keeping
them separated. [rereading - I'll cut some of the quoting to keep this
message small... yet it's big, sorry!]

Well, here's how i see it now:

first of all, i was REALLY wrong in my understanding of how 'lock' works:
it does not have to delay all activities on all CPUs, but just other 'lock's.
In other words, if CPU B updates the lock variable by means of a simple
store (without the lock prefix), this is just buggy code; reads or stores
on different variables are not aware of lock operations, as they don't
need to. Only 'lock' operations get serialized.
Cache coherency makes following 'lock's see the updated value of
the lock variable at once, with no delay.
unlock(), which does a normal store, MAY be seen delayed on other CPUs,
causing then to waste some cycles spinning.
Protecting the unlock() store with a 'lock' may avoid this, but that's
optimizing for the uncommon case (hopefully having a CPUs waiting in
a spinlock is rare).

On Fri, 26 Nov 1999, David Schwartz wrote:
[...]
> No. Memory barriers are not system-wide. They only affect the reordering of
> memory operations from a single processor.
>
> > I.E. for (;;) { mb(); read a; } will force all the pending stores on other
> > CPUs being committed before every read (?)
>
> No. No. A memory barrier simply prevents the active processor from
> 'migrating' reads or writes across the barrier.

I see. So, this is really a UP matter, not SMP. mb() is used to make a
sequence of instructions "atomic" on this CPU, as regards reordering:

<no effects of the "atomic" code are seen here>
mb()
<"atomic" code>
mb()
<ALL effects of the "atomic" code must be seen here>

I said "effects" because reads can execute before, but if something changes
later (store) the results of the pre-execution are discarded, and reads
re-executed at the right time.

>
> > BTW, I thougth that a lock
> > does something even worse: prevents all other CPUs from executing any
> > read/store until the "locked" instruction on this CPU is completed, thus
> > "serializing" all read/stores system-wide.
>
> No, that's not what a lock does either. A lock simply makes a
> 'read-modify-write' operation atomic.

Well, system-wide, anyway. Stores made by lock operations MUST be seen
on all CPUs as soon as they complete. This requirement is mainly the
origin of my mistakes. Stores MUST be seen, but only by the reads made
inside another lock operation. That is: every read made inside a lock
(but not normal reads) MUST see any previous store made inside a lock
(but not normal stores). Lock-reads enforce coherency of lock-stores
only. Also, lock operation are stricly serialized, ie they are not
allowed to overlap on different CPUs. When a lock operation starts,
it prevents (blocks) all CPUs trying to execute another lock. CPUs
executing normal operation are unaffected. Since three on more
CPUs (even out of 16) trying to execute lock at the same time is very
unlikely, that not an issue (even two is uncommon, since the section
proteced by a spin_lock should be short).

Here, "previous/following" mean in the absolute timeline. What made me
confused, is the (many) way you look at a sequence of instructions.
There's the order they are (supposed to be) in the source code (C assumed
here, of course). Compilers do reorder them, to optimize CPU work.
This leads to an order of instruction in memory which may be different.
This is (I hope) the same order the CPU fetches them from memory
(but I'm not sure). Anyway, they may be reordered at execution time,
and that's what mb()'s are for, to help programmers control this.
Then there's the order they are excuted, from an absolute timeline,
as seen by an external observer, who can see all CPU/cache/bus/memory
activity. Last, that's the order of instructions (better, instruction
effects) as seen by other system entities, as CPUs. That's what we
care about on SMP, and that's what Processor Order is all about (i think).

The discussion on Processor Order pointed out that we don't really care
of real (absolute) order, but only of the order effects are seen.
There are primitives that enforce serialization of effects, by means
of tightening them to the absolute timeline (that's Strict Order, BTW),
and that's what lock does. Effects of read/stores in a lock are strictly
serialized. That's required to implement mutex (spinlocks).

So, when analyzing a piece of code (e.g. that spin_locks), first you
should take in account compiler optimizations: use 'volatile' where needed,
inline some asm (also used to generate specific instructions, such as
'lock'). Then, having the picture of what is generated, apply any
"reordering rules" the CPU uses runtime. Put some mb()'s to keep things
ordered the way you want. You get what is really executed on a single CPU.
Now, while evaluating concurrency, apply the Processor Order model, to see
how the effects of instructions executed by another CPU are seen on this
CPU. If you're accessing some shared data, you need a mutex, which is
implemented by means of an atomic test-and-set under the 'lock' prefix,
which makes it sistem-wide atomic (serialized) with respect of other
'lock's on other CPUs.
Does it make sense?

[...]
>
> Your analysis is entirely wrong because it is grounded in misunderstandings
> of what memory barriers and locks actually do. They have nothing to do with
> 'committing' reads or writes to main memory.

Agreed. Probably, under a 'lock', stores do get commited at once, or reads
do force the commit of previous locked stores. Mutual exclusion needs that,
i think.

[...]
> On x86 architectures, stores are always performed in order. However, reads
> may be reordered and the relative order of reads and writes is not
> guaranteed. This is what memory barriers are for.

Oh! You really mean that (pseudocode, UP):

a = 0; (write a)
...
a = 1; (write a)
b = a; (read a; write b)

print b;

brings to undefined result, could be b=1 as expected, or b=0 if read a
gets executed before write a? Should it be rewriten:

a = 0; (write a)
...
a = 1; (write a)
mb(); /* read can't walk up this */
b = a; (read a; write b)

mb();
print b;

to be sure that 1 is printed? Print b contains a read b which may be executed
before the write b, without the mb(), too. I'm pretty sure i'm wrong.
'read a' means "i want to see the effect of the last write on a", so
can't be moved before the last write or its very semantic will break.

'read b' CAN be moved before 'write a', because it's actually "i want to
see the effect of the last write on b, who cares of a?". But i can't
imagine any case in which you may want to prevent this from happening
with a mb(). Unless the variables are shared, e.g. you're "talking" to
another chip via I/O registers or shared mem. That's completely different,
as you may want to use UC anyway, and it may also happen that a 'read'
is not just that, but implies some status changes in you peer, so acts
also as a write (logically), with the CPU seeing it as a "just a read"
and reordering it and of course you don't want that.

So, forgetting the I/O registers case, when do mb()'s come into play?
When do you really want to keep 'read b' from executing before 'write a'?
(assuming you wrote "write a; read b", of course). If the answer is
"never", on UP you don't need mb()'s unless special cases, which are
really some kind of MP, as two actors come into play.

I can think of SMP, where if a and b are shared but you want to see them
via your WT cache, for performance reasons (not only). If you don't use
locks, you may want to keep your access pattern ordered. Since there's
a kind of communcation protocol you must adhere to (logically),
you can't just ask for the answer before you put the question, in a way.

But when does that happen? Two (or more) CPUs accessing *r/w* shared
data without lock (better, outside a mutex-protected critical section)?
Note that if one CPU accesses data read-only, it does not matter if it
reads *old* data, by definition. If you're taking into account time,
or better serialization of events, you need a lock anyway. If you want
to read coherent data (before or after a certain atomic event, call it
a transaction) you still need some locking. No, if you don't need
locking, you need mb()'s only if there's a concurrent r/w access to
the shared data. If you lock, you don't need mb()'s, since only
one CPU executes the critical section, so we fall back to the UP case.
You just need all changes you make from your critical section to be
committed (visible for others) as a whole, and no only part of them.
As we have already seen, using lock, thanks to Processor Order,
this holds. No mb() required, with lock.

So, either the semantic of read is broken, and you need mb() almost every-
where, even to decrement a counter (which of course i don't believe),
or mb() are very seldom necessary, just for r/w access to shared variables
on (S)MP OUTSIDE a critical section (no lock).

[...]
> There are no forced commits directly. However, the cache coherency logic
> will do the equivalent if one processor tries to gain access to an area of
> memory that another processor has in its cache.
>
[...]
> This is already done by the cache coherency logic. All operations are
> always committed exactly when needed to avoid conflict with an operation
> from another processor on the same memory.
>
[...]
> The thing is, the other CPU will eventually cause the cache coherency logic
> to arbitrate for that area of memory. That will force what you call a commit
> (but actually it's a writeback).

Does it mean that every read will always see the right value, never an old
one? Isn't it Strict Order instead of Processor Order?

>
> DS

.TM.

-- 
      ____/  ____/   /
     /      /       /			Marco Colombo
    ___/  ___  /   /		      Technical Manager
   /          /   /			 ESI s.r.l.
 _____/ _____/  _/		       Colombo@ESI.it

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/