Re: [RFC][PATCH 0/5] Signal scalability series

From: Linus Torvalds
Date: Mon Oct 03 2011 - 17:45:49 EST

On Mon, Oct 3, 2011 at 1:58 PM, Matt Fleming <matt@xxxxxxxxxxxxxxxxx> wrote:
> No, I don't think there was anything wrong with your testing method. I
> ran your command-line under Qemu and saw similar results - with the
> patches applied the single-threaded case slows down (not by 50%, it
> looks more like 25%, but that's still unacceptable and not at all what I
> had anticipated).

Splitting up locks fairly easily causes these kinds of problems.

On many modern microarchitectures, the serialization implied by
locking can be a *big* performance hit. If a system call goes from a
single big lock to two split locks, that can easily make that system
call very noticeably slower. The individual locks may protect a much
smaller section and be "more scalable", but the end result is actually
clearly worse performance.

We've had that several times when we've made smaller locks (in the VM
in particular). One big lock that you take once can be way better than
two small ones that you have to take in sequence (or, worse still,
nested - that's when you can *really* get into exponential badness).

And with even a very limited number of threads (or processes passing
signals back-and-forth) you can get a "train effect": two cores
accessing the same two locks in order, so that they get synchronized.
The "get synchronized" event itself might even be rare, but once it
happens, things can stay synchronized.

And if the second one always then ends up blocking and/or just causing
cacheline ping-pongs, that slowdown can go up by an absolutely huge
amount because you basically make the "rare" case be the common one.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at