Re: spin_unlock optimization(i386)

Erich Boleyn (esboleyn@ichips.intel.com)
Thu, 25 Nov 1999 20:35:02 -0800


Ingo Molnar <mingo@chiara.csoma.elte.hu> wrote:

...
> > If Intel only needed to be atomic, doing the "lock btsl" would not need to
> > take 22 cycles. The cost comes from being a syncronization barrier, which
> > is needed to avoid the re-ordering that intel _will_ do.
>
> if i was Intel and had the instruction traces they have, maybe i'd have
> been a wimp as well :) Making LOCK-ed instruction go straight through via
> just relying on a (hypotetical) internal 'conflict prevention' mechanizm
> would definitely have been a feat, and would have triggered all the
> remaining bugs ;)

I'm pretty sure that we err on the side of paranoia here sometimes. There
is a LOT of software which depends on us being right, and it's better to
hedge than to get an extra cycle or two and lose something due to other
troubles...

But, it is not needed to avoid any kind of re-ordering that IA32 will
do. I still maintain that.

IA32 just doesn't do program visible reordering. Any reordering is
done only as a performance enhancement and with temporary results that
are checked prior to becoming committed state.

...
> ok. I think we can actually prove wether reads can even execute noncausal.
> Lets consider the following code:
>
> CPU0 CPU1
>
> volatile int i, __fill[8], j; int a, b;
>
> i = j = 0;
>
> for (;;) { for (;;) {
> i++; a = i;
> j++; b = j;
> } if (a < b)
> BUG()
> }
>
> (i and j are shared common variables on different cachelines, a and b are
> local ones)
>
> if your theory is true, then 'BUG()' should trigger sooner or later,
> correct?

No.

To guarantee that you've observed the store to "i", you HAVE to read
the store to "j" first! Plus, your second "for" loop could have blocked
just after reading "i" into "a", while the first loop could be continuing
to count upward. "b" could end up with an arbitrarily larger value.

To make this work correctly in the face of both Processor Ordering
and pauses in the process execution, the second loop should actually say:

for (;;) {
b = j;
a = i;
if (a < b)
BUG();
}

With Weak Ordering, this could fail, since you may have not gotten
all the updates to "i" yet.

What you know from Processor Ordering is:

If you see a particular store from processor A, you know you've
seen all PRIOR stores in program order from processor A.

This means you have to gate relying on data from another processor until
you've checked that you have seen kind of "completion event".

This is exactly what you get with a spinlock situation:

P0 P1

X = 0;
spin_unlock(Y); // does a store to Y

spin_lock(Y);
READ X;

So, P0 does the stores to X, then Y, P1 waits for the store to Y,
which guarantees P1 will see the right data in X.

Erich Boleyn
PMD IA32 Architecture
Intel

--
Erich Boleyn
PMD Architecture
<esboleyn@ichips.intel.com>

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/