Re: [PATCH 1/19] MUTEX: Introduce simple mutex implementation

From: Linus Torvalds
Date: Fri Dec 16 2005 - 11:28:28 EST




On Fri, 16 Dec 2005, David Howells wrote:
>
> No, they're not. LL/SC is more flexible than CMPXCHG because under some
> circumstances, you can get away without doing the SC, and because sometimes
> you can do one LL/SC in lieu of two CMPXCHG's because LL/SC allows you to
> retrieve the value, consider it and then modify it if you want to. With
> CMPXCHG you have to anticipate, and so you're more likely to get it wrong.

You can think of LL/SC as directly translating into LD/CMPXCHG, so in that
sense CMPXCHG is no less flexible. LL/SC still has other advantages,
though. See later.

> I've had a play with x86, and on there CMPXCHG, XCHG and XADD give worse
> performance than INC/DEC for some reason. I assume this is something to do
> with how the PPro CPU optimises itself. On PPro CPUs at least, counting
> semaphores really are the most efficient way. CMPXCHG, whilst it ought to be
> better, really isn't.

The notion that CMPXCHG "ought to be better" is a load of bull.

There are two advantages of "lock inc/dec" over "ld/cmpxchg": one is the
obvious one that the CPU core just has a much easier time with the
unconditional one, and never has to worry about things like conditional
branches or waste cycles on multiple instructions. Just compare the
sequences:

lock inc mem

vs

back:
load mem,reg1
reg2 = reg1+1
cmpxchg mem,reg1,reg2
jne forward # get branch prediction right
return
forward:
jmp back

guess which one is faster?

The other one depends on cache coherency: the "lock inc" can just get the
cacheline for exclusive use immediately ("read with intent to write"). In
contrast, the ld/cmpxchg first gets the cacheline for reading, and then
has to turn it into an exclusive one. IOW, there may literally be lots of
extra bus traffic from doing a load first.

In other words, there are several advantages to just using the simple
instructions.

(Of course, some CPU's have "get cacheline for write" instructions, so you
can then make the second sequence even longer by using that).

Using "xadd" should be fine, although for all I know, even then
microarchitectural issues may make it cheaper to use the simpler "lock
add" whenever possible.

In LL/SC, I _think_ LL generally does its read with intent to write.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/