Re: [PATCH] Document Linux's memory barriers

From: Linus Torvalds
Date: Thu Mar 09 2006 - 11:30:36 EST




On Thu, 9 Mar 2006, David Howells wrote:
>
> So, you're saying that the LOCK and UNLOCK primitives don't actually modify
> memory, but rather simply pin the cacheline into the CPU's cache and refuse to
> let anyone else touch it?
>
> No... it can't work like that. It *must* make a memory modification - after
> all, the CPU doesn't know that what it's doing is a spin_unlock(), say, rather
> than an atomic_set().

Basically, as long as nobody else is reading the lock, the lock will stay
in the caches.

Only old and stupid architectures go out to the bus for locking. For
example, I remember the original alpha "load-locked"/"store-conditional",
and it was totally _horrible_ for anything that wanted performance,
because it would do the "pending lock" bit on the bus, so it took hundreds
of cycles even on UP. Gods, how I hated that. It made it almost totally
useless for anything that just wanted to be irq-safe - it was cheaper to
just disable interrupts, iirc. STUPID.

All modern CPU's do atomic operations entirely within the cache coherency
logic. I think x86 still support the notion of a "locked cycle" on the
bus, but I think that's entirely relegated to horrible people doing locked
operations across PCI, and quite frankly, I suspect that it doesn't
actually mean a thing (ie I'd expect no external hardware to actually
react to the lock signal). However, nobody really cares, since nobody
would be crazy enough to do locked cycles over PCI even if they were to
work.

So in practice, as far as I know, the way _all_ modern CPU's do locked
cycles is that they do it by getting exclusive ownership on the cacheline
on the read, and either having logic in place to refuse to do release the
cacheline until the write is complete (ie "locked cycles to the cache"),
or to re-try the instruction if the cacheline has been released by the
time the write is ready (ie "load-locked" + "store-conditional" +
"potentially loop" to the cache).

NOBODY goes out to the bus for locking any more. That would be insane and
stupid.

Yes, many spinlocks see contention, and end up going out to the bus. But
similarly, many spinlocks do _not_ see any contention at all (or other
CPU's even looking at them), and may end up staying exclusive in a CPU
cache for a long time.

The "no contention" case is actually pretty important. Many real loads on
SMP end up being largely single-threaded, and together with some basic CPU
affinity, you really _really_ want to make that single-threaded case go as
fast as possible. And a pretty big part of that is locking: the difference
between a lock that goes to the bus and one that does not is _huge_.

And lots of trivial code is almost dominated by locking costs. In some
system calls on an SMP kernel, the locking cost can be (depending on how
good or bad the CPU is at them) quite noticeable. Just a simple small
read() will take several locks and/or do atomic ops, even if it was cached
and it looks "trivial".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/