Re: spin_unlock optimization(i386)

Marco Colombo (marco@esi.it)
Sun, 28 Nov 1999 03:27:21 +0100 (MET)


On Sat, 27 Nov 1999, David Schwartz wrote:

>
> > Thank you for your answer. It (combined with the morning light) cleared
> > things up. I was simply wrong, mixing different issues (instruction
> > order, cache behavior and serialization) together instead of keeping
> > them separated. [rereading - I'll cut some of the quoting to keep this
> > message small... yet it's big, sorry!]
>
> It's not an easy thing to wrap your brain around.

Definitely.

[skipping a lot]

> > Well, system-wide, anyway. Stores made by lock operations MUST be seen
> > on all CPUs as soon as they complete.
>
> This is not true. They just have to be atomic. If another processor wants
> to see it, then its cache coherency logic will have to arbitrate for it. And
> it's not even guaranteed to have actually been done until another store
> operation takes place. (Since x86's reserve the right to delay reads and
> writes).

Maybe i'm wrong, but no. Let's see:

CPU1 CPU2
t0
t1 'lock' read spin_lock
store spin_lock (updated)

t2 'lock' read spin_lock
<loops>

'lock' (executed by CPU1 at t1) must first block other CPUs from executing
other 'lock' operations (on the same variable, at least), for the whole t1.
The operation is atomic (serialized) system-wide. But at t2 CPU2 MUST see
the updated value of spin_lock with no delay, or the mutual excusion will
fail. So cache coherency must be forced some how. If this is done by CPU1
flushing the cache or by CPU2 'snooping' it, i don't know. But it MUST
happen some how. And please note there's no "another store" on CPU2.
Just a read. Normal cache coherency (allowing delayed store effects)
does not apply here. You need something 'stronger'.

> > This requirement is mainly the
> > origin of my mistakes. Stores MUST be seen, but only by the reads made
> > inside another lock operation. That is: every read made inside a lock
> > (but not normal reads) MUST see any previous store made inside a lock
> > (but not normal stores). Lock-reads enforce coherency of lock-stores
> > only. Also, lock operation are stricly serialized, ie they are not
> > allowed to overlap on different CPUs.
>
> Lock operations can overlap on different CPUs, providing they are locking
> different things. If they try to lock the same thing, cache coherency will
> save you.
>
> > When a lock operation starts,
> > it prevents (blocks) all CPUs trying to execute another lock. CPUs
> > executing normal operation are unaffected. Since three on more
> > CPUs (even out of 16) trying to execute lock at the same time is very
> > unlikely, that not an issue (even two is uncommon, since the section
> > proteced by a spin_lock should be short).
>
> Nope. Not true at all. The lock just makes the operation atomic, preventing
> another processor from 'stealing' the memory out from under it.

I think we're saying the same thing. If another CPU tries to perform the same
'lock', it will be delayed (stopped for a while).

> > Here, "previous/following" mean in the absolute timeline. What made me
> > confused, is the (many) way you look at a sequence of instructions.
> > There's the order they are (supposed to be) in the source code (C assumed
> > here, of course). Compilers do reorder them, to optimize CPU work.
> > This leads to an order of instruction in memory which may be different.
> > This is (I hope) the same order the CPU fetches them from memory
> > (but I'm not sure).
>
> No. Not at all. The CPU can also reorder memory operations. The x86 doesn't
> reorder write operations, but other CPUs even do that.
>
> > Anyway, they may be reordered at execution time,
> > and that's what mb()'s are for, to help programmers control this.
>
> Right.
>
> > Then there's the order they are excuted, from an absolute timeline,
> > as seen by an external observer, who can see all CPU/cache/bus/memory
> > activity. Last, that's the order of instructions (better, instruction
> > effects) as seen by other system entities, as CPUs. That's what we
> > care about on SMP, and that's what Processor Order is all about (i think).
> >
> > The discussion on Processor Order pointed out that we don't really care
> > of real (absolute) order, but only of the order effects are seen.
>
> Right. And that's why we use memory barriers.
>
> > There are primitives that enforce serialization of effects, by means
> > of tightening them to the absolute timeline (that's Strict Order, BTW),
> > and that's what lock does. Effects of read/stores in a lock are strictly
> > serialized. That's required to implement mutex (spinlocks).
>
> Nope. Again, lock just makes 'read/modify/write' operations atomic. No
> more, no less.

I think we don't agree on "atomic" here. "atomic" means to me that either
other CPUs don't see any effects of the operations protected by 'lock'
(you are "before") or they see them all (you are "after"). You can't be
"in the middle". That's not enough. If one CPU acquires the lock,
other CPUs MUST become aware of that as soon as it happens, or they may
enter the same critical section. Since they perform a 'lock read' to see
if the lock is acquired, the store made by the first CPU and the reads made
by others must be ordered. You need something stronger than Processor Order.


CPU1 CPU2
a=0
<read a; a=0
modify; a=0
store a;> a=0
<critical sec.> a=1
a=1 <read a; (*)
a=1 modify;
a=1 store a;>
a=1 <critical sec.>

(*) under Processor Order here you may see the old value, 0.
The store a made by CPU1 MUST propagate immediately to CPU2 so read a
can see it. You can't simply put a mb() as the first instruction of the
critical section, as you just make the window smaller. The mb() must be
part of the atomic code, right after the store. That's what i'm saying:
the result of the store must be immediately visible to all CPUs.
(Limited to other lock read. A real mb() will act on normal reads, too,
i think).

>
> > So, when analyzing a piece of code (e.g. that spin_locks), first you
> > should take in account compiler optimizations: use 'volatile'
> > where needed,
> > inline some asm (also used to generate specific instructions, such as
> > 'lock'). Then, having the picture of what is generated, apply any
> > "reordering rules" the CPU uses runtime. Put some mb()'s to keep things
> > ordered the way you want. You get what is really executed on a single CPU.
> > Now, while evaluating concurrency, apply the Processor Order model, to see
> > how the effects of instructions executed by another CPU are seen on this
> > CPU. If you're accessing some shared data, you need a mutex, which is
> > implemented by means of an atomic test-and-set under the 'lock' prefix,
> > which makes it sistem-wide atomic (serialized) with respect of other
> > 'lock's on other CPUs.
> > Does it make sense?
>
> Mostly.
>
> > > On x86 architectures, stores are always performed in order.
> > However, reads
> > > may be reordered and the relative order of reads and writes is not
> > > guaranteed. This is what memory barriers are for.
> >
> > Oh! You really mean that (pseudocode, UP):
> >
> > a = 0; (write a)
> > ...
> > a = 1; (write a)
> > b = a; (read a; write b)
> >
> > print b;
> >
> > brings to undefined result, could be b=1 as expected, or b=0 if read a
> > gets executed before write a? Should it be rewriten:
>
> No, no. The reordering of reads or writes is invisible to UP code. The
> processor can see what affects reordering has and will not reorder such that
> the effects change the results of code. The problem is, the processor cannot
> take into account what other processors might be doing.

So, only it's a MP issue.

>
> > a = 0; (write a)
> > ...
> > a = 1; (write a)
> > mb(); /* read can't walk up this */
> > b = a; (read a; write b)
> >
> > mb();
> > print b;
> >
> > to be sure that 1 is printed? Print b contains a read b which may
> > be executed
> > before the write b, without the mb(), too. I'm pretty sure i'm wrong.
> > 'read a' means "i want to see the effect of the last write on a", so
> > can't be moved before the last write or its very semantic will break.
>
> Yes, it can be moved before the last write without breaking the semantic.
> The semantic doesn't care when the read or write actually takes place, just
> what value you get!
>
> > 'read b' CAN be moved before 'write a', because it's actually "i want to
> > see the effect of the last write on b, who cares of a?". But i can't
> > imagine any case in which you may want to prevent this from happening
> > with a mb(). Unless the variables are shared, e.g. you're "talking" to
> > another chip via I/O registers or shared mem. That's completely different,
> > as you may want to use UC anyway, and it may also happen that a 'read'
> > is not just that, but implies some status changes in you peer, so acts
> > also as a write (logically), with the CPU seeing it as a "just a read"
> > and reordering it and of course you don't want that.
> >
> > So, forgetting the I/O registers case, when do mb()'s come into play?
>
> When you really do care not just what values you get but in what order the
> reads and writes actually take place. Consider:

I was still in the UP case. Sorry for not being clear.

Your following example it's two CPUs accessing shared variables without
locking.

>
> Definitions:
> int valid=0;
> int value=0;
>
> CPU1:
> value=45;
> valid=1;
>
> CPU2:
> while(valid==0);
> printf("%d\n", value);
>
> Now, even though the x86 doesn't reorder writes, it does reorder reads. So
> the read for 'valid==0' could take place _after_ the read for value, even
> though the read for value is in the next line. The processor can't see any
> side effects to this reordering. But you see them!

Completely agreed.

[i'll skip another bit of your message, as we agree on that]

> > (assuming you wrote "write a; read b", of course). If the answer is
> > "never", on UP you don't need mb()'s unless special cases, which are
> > really some kind of MP, as two actors come into play.
>
> On UP, you shouldn't need memory barriers. Unless you're dealing with other
> devices on the bus.

Exactly what i mean with "some kind of MP". Again we agree completely here.
(and thanks: my knowledge comes from yours) B-)

See that now i'm talking about SMP...

> > I can think of SMP, where if a and b are shared but you want to see them
> > via your WT cache, for performance reasons (not only). If you don't use
> > locks, you may want to keep your access pattern ordered. Since there's
> > a kind of communcation protocol you must adhere to (logically),
> > you can't just ask for the answer before you put the question, in a way.
>
> Right.
>
> > But when does that happen? Two (or more) CPUs accessing *r/w* shared
> > data without lock (better, outside a mutex-protected critical section)?
>
> The problem is not so much 'without lock', the problem is implementing the
> 'lock' itself.

If i recall my studies well, you can implement locking WITHOUT hardware
(lock) support, just using normal reads and writes. On N CPUs it's
quite inefficent. In THAT case, you REALLY need mb()'s almost
everywhere in the software lock code, I agree. You gain syncronization
relaying on instruction order on every single CPU.
But we do have hardware support. We don't need to implement lock.
We have a locking primitive.
Oh, maybe you mean we're implementing the lock() function. OK.
Your example made clear to me a possible use of mb(). See below.

> > Note that if one CPU accesses data read-only, it does not matter if it
> > reads *old* data, by definition.
>
> Not so! Consider the loop above: "while(valid==0); printf("%d\n", value);".
> These are just two reads, and we are very concerned with reading old data in
> the printf.

True. You're right. mb()'s lets you implement a safe read of shared variables,
with no need of a lock. Now i see what they're for.

> > If you're taking into account time,
> > or better serialization of events, you need a lock anyway.
>
> Yes, but for complex reasons. Otherwise you might think the following is
> just fine:
>
> volatile int value,valid;
>
> CPU1:
> value=45;
> valid=1;
>
> CPU2:
> while(valid==0);
> printf("%d\n", value);
>
> It's not obvious why you need a lock in this case. But I think you can now
> see that the problem is that the reads may not actually take place in the
> order you coded them.

Yes. It's so obscure that I don't see it (the need for a lock). A mb() should
be enough, since CPU2 access is read-only.

> > If you want
> > to read coherent data (before or after a certain atomic event, call it
> > a transaction) you still need some locking. No, if you don't need
> > locking, you need mb()'s only if there's a concurrent r/w access to
> > the shared data. If you lock, you don't need mb()'s, since only
> > one CPU executes the critical section, so we fall back to the UP case.
>
> No, you often need both locks and memory barriers, since they do vastly
> different things.

Now i see. There are may uses of mb()'s outside lock-protected code.
But inside a critical section, how do you need them? Only one CPU
executes the code, so that's really like the UP case.
Well, let me take one example of yours (modified):

lock();
mb();

read a;
store a;

mb();
unlock();

I think the mb()'s are not needed. You acquire the lock only if you see the
effect of a previous unlock() (a store). Given Processor Order, if you see
that store, you must see also the effects of previous stores, such as the
'store a'. So, the 'read a' can't be executed before the lock (that's what
you're protecting us from, i think). Also, since stores are already in order,
you don't need the mb() between store a and unlock().
Here i'm assuming no one else is modifing 'a' without lock(). That is, the
only 'store a' is the one in the critical section. Having some CPU modify
it without acquiring the lock (somewhere else in the code) is bad programming
anyway, i think.

> > You just need all changes you make from your critical section to be
> > committed (visible for others) as a whole, and no only part of them.
>
> Yes, but without atomic operations, it's hard to let other processors know
> that you've finished all your operations.

A critical section *is* atomic, for the point of view of others trying to
execute it. If some other CPU accesses the shared variables (read-only, say)
without acquiring the lock, yes, it will see in-progress operations.
There you may need mb()'s, to make its "view" coherent at least. Sounds
quite a dirty hack, to me, but i've nothing against them if they really
pay B-) ("pecunia non olet").

> > As we have already seen, using lock, thanks to Processor Order,
> > this holds. No mb() required, with lock.
>
> Not so, because even if processor 1 executes a locked operation, that
> doesn't mean that processor 2 does, so reads could be reordered around the
> read that detects the lock. You still need memory barriers. (If you don't
> see this, look at my examples above and see if there's any way a LOCK will
> help you there. The problem is totally one of ordering. Ordering and
> atomicity are different issues.)

CPU1:
lock();
value=45;
valid=1;
unlock();

CPU2:
ok=0;
do {
lock();
ok = valid;
unlock();
} while (ok==0);
printf("%d\n", value);

i think it's ok. Slower, but correct. mb()'s are a better solution here, of
course. Unless CPU2 updates 'value', a lock is not neccessary. But it does
the job.

>
> > So, either the semantic of read is broken, and you need mb() almost every-
> > where, even to decrement a counter (which of course i don't believe),
> > or mb() are very seldom necessary, just for r/w access to shared variables
> > on (S)MP OUTSIDE a critical section (no lock).
>
> Nope. You missed the third alternative -- mb() are necessary any place
> where something outside the processor could cause its heuristics to think a
> reordering is safe when it isn't.

Yup. But that happens only with SMP and shared variable... what i was
completely missing is their use for correct r-o access to shared vars.

> > Does it mean that every read will always see the right value, never an old
> > one? Isn't it Strict Order instead of Processor Order?
>
> Every read will always see the right value on UP, whether or not the read
> actually takes place when you think it does. Because of this, you have to be
> very careful when there are circumstances outside that single processor's
> control. It becomes too smart for its own good.
>
> DS

.TM.

-- 
      ____/  ____/   /
     /      /       /			Marco Colombo
    ___/  ___  /   /		      Technical Manager
   /          /   /			 ESI s.r.l.
 _____/ _____/  _/		       Colombo@ESI.it

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/