RE: spin_unlock optimization(i386)

Marco Colombo (marco@esi.it)
Sat, 27 Nov 1999 03:59:48 +0100 (MET)


On Fri, 26 Nov 1999, David Schwartz wrote:

> > If not, does it mean that the serializing lock read made by CPU B
> > while spinlocking force a global (=on all CPUs) commit of any pending
> > store (thus making all the system work as WT for all the time B
> > spinlocks)?
>
> Memory barriers enforce order. They can't force an actual commit.

Does the 'lock' before the 'btsl' mean it's a memory barrier?

As i understand it, if your read something right after a mb(),
you see every previous store (system-wide), as if in Strict Order.
I.E. for (;;) { mb(); read a; } will force all the pending stores on other
CPUs being committed before every read (?) BTW, I thougth that a lock
does something even worse: prevents all other CPUs from executing any
read/store until the "locked" instruction on this CPU is completed, thus
"serializing" all read/stores system-wide.
Such a thing may (i think is) be needed in the lock() since you have to
serialize the read (no two CPUs can see a value of 0 at the same time), even
if the variable is cached (on both). To draw it:

CPUA CPUB

asserts LOCK: waits for other
CPUs to complete pending
(read/store) instructions, <before the spinlock, don't care>
prevents them from executing
others, commits all changes
(makes them visibile) on all CPUs

reads and stores the updated <"blocked" by CPU A>
spin_lock value

releases LOCK, other CPUs
may now go on reading/
writing
*note that the store may not
be committed at this time*
<resumes>

tries to do the same lock():
and as a side effect, the store
made by CPUA gets committed (*)

reads the spin_lock values and
fails (lock is held by CPUA)

<loops>

releases the lock (store)

goes thru the LOCK thing again,
the store on CPUA gets committed,
now the test succeeds and the
lock is acquired

(*) that's where the serializing properties of 'lock' are needed. The
following read MUST see the updated value of spin_lock. No delay is allowed
at all. 'lock' on CPU A has to hold back the read on CPU B until the
store is executed, and 'lock' on CPU B has to make sure the store made
by A is visible by B (ie committed) before the following read.

> > So, either 1) we're wasting cycles on CPU B spinning on a *free* lock (for
> > a while), or 2) the whole system performs as WT (or even worst) while B
> > is spinning. Am i missing something?
>
> You are correct. The same applies (generally) to user-level mutexes.

Ehr, ok, but which one of the two? I assume you mean 1)...
But if i'm right about the 'lock' behaviour, 2) holds, and 1) doesn't
happen, since the store made by A is forced by the 'lock' on B, which
takes only one loop. Sorry for not being clear.

> Usually, a mutex lock is implemented as:
>
> lock();
> mb();
>
> And an unlock as:
>
> mb();
> unlock();
>
> While this ensures correct behavior, it does not assure that you don't wait
> 'too long'. And a memory barrier can't do that anyway.

I see. Yet, i've read that a just a store acts as a mb, and if true, and
if a mb is all we need, that i must be wrong with the whole purpose of 'lock'.
And i'm lost.

> The cache coherency logic, on the other hand, should assure that the loop
> eventually reads the right data, since it forces the cache of that processor
> to assume ownership (or at least shared ownership) of that chunk of memory.

Does this means that other pending stores on that chunk get written?
So the "forced commit" i'm referring to happens only to that chunk, and
that's no as bad as i thought. That makes a lot of sense. Still i think
the 'lock' serializes operations also on different, possibly unrelated,
variables, so all pending stores have to be committed on all CPUs,
all chunks.

>
> > 1) could be solved forcing a commit on CPU A right after (as part of) s
> > the unlock() (but, is it possible? were we doing precisely that, before
> > the whole thread starts?);
>
> What does 'forcing a commit' mean? You're not suggesting the cache be
> flushed, are you?

Well, i have no clear idea here... remember i haven't read the real
docs, so i don't know what a 'commit' is: i'll expain what I mean:
basically, commit means make then other CPUs see the change *now*.
If this means flushing the cache, or just make the change visible through
some 'snooping' mechanism, without performing the actual writing,
i don't know.

My concern here is really the penalty of having one CPU spinning on
an instruction with the 'lock' serializing property, which, given that
it works the way i've described (and it may not), kills performance
system-wide. Having it spin on a nop for a little while, may let the other
CPUs go on with their work undisturbed, at the price of some wasted cycles.
If the critical section (on average) is, say, 200 cycles, we may let
CPU B spin in a inner loop for 100 cycles, and it will make only one or
two 'lock's instead of 9-10, and the other 15 CPUs will have 200 cycles
almost clean of those 'lock's. On 2 CPUs I agree this doesn't pay.
On 16 maybe.

Thank you for your time (spent reading my probably too long messages).
.TM.

-- 
      ____/  ____/   /
     /      /       /			Marco Colombo
    ___/  ___  /   /		      Technical Manager
   /          /   /			 ESI s.r.l.
 _____/ _____/  _/		       Colombo@ESI.it

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/