Re: LMbench 2.4.0-test10pre-SMP vs. 2.2.18pre-SMP

From: Linus Torvalds (torvalds@transmeta.com)
Date: Tue Oct 24 2000 - 12:21:52 EST


In article <Pine.LNX.4.21.0010232338440.1762-100000@ferret.lmh.ox.ac.uk>,
Chris Evans <chris@scary.beasts.org> wrote:
>
>On Mon, 23 Oct 2000, Jeff Garzik wrote:
>
>> First test was with 2.4.0-test10-pre3.
>> Next four tests were with 2.4.0-test10-pre4.
>> Final four tests were with 2.2.18-pre17.
>>
>> All are 'virgin' kernels, without any patches.
>
>[...]
>
>I'll take the liberty of highlighting some big changes, v2.2 vs v2.4
>
>*Local* Communication latencies in microseconds - smaller is better
>-------------------------------------------------------------------
>Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
> ctxsw UNIX UDP TCP conn
>--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
>rum.normn Linux 2.4.0-t 6 20 45 63 106 81 157 146
>rum.normn Linux 2.2.18p 2 12 18 56 123 106 159 237
>
>- So we broke pipe/AF UNIX latencies

Not really.

The issue is that under 2.4.x, the Linux scheduler is very actively
trying to spread out the load across multiple CPU's. It was a hard
decision to make, exactly because it tends to make the lmbench context
switch numbers higher - the way lmbench does context switch testing is
to pass a simple packet back and forth between pipes, and it's actually
entirely synchronous. So you get the best lmbench numbers when both the
receiver and the sender stay on the same CPU's (mainly for cache
reasons: everything stays on the same CPU, no spinlock movement, and no
pipe data movement from one CPU to another).

This was something people (mainly Ingo) were very conscious of when
doing scheduler changes: lmbench was something that we all ran all the
time for this, and it was not pretty to see the numbers go up when you
schedule on another CPU.

But in real life, the advantage of spreading out is actually noticeable.
In real life you don't tend to have quite as synchronous a data passing:
you have the pipe writer continue to generate data, while the pipe
reader actually _does_ something with the data, and spreading out on
multiple CPU's means that this work can be done in parallell.

Oh, well.

(I don't actually know why AF UNIX went up, but it might be the same
issue. The new networking code is fairly asynchronous, which _really_
improves performance because it is able to actually make good use of
multiple CPU's unlike the old code, but probably for similar reasons as
the pipe latency thing this is bad for AF UNIX. I don't think anybody
has bothered looking that much: a lot more work has been put into TCP
than into AF UNIX).

>File & VM system latencies in microseconds - smaller is better
>--------------------------------------------------------------
>Host OS 0K File 10K File Mmap Prot Page
> Create Delete Create Delete Latency Fault Fault
>--------- ------------- ------ ------ ------ ------ ------- ----- -----
>rum.normn Linux 2.4.0-t 15 1 28 3 1016 1 0.0K
>rum.normn Linux 2.2.18p 16 1 29 2 7658 2 0.6K
>
>- But gave steroids to mmap latencies

This is due to all the VM layer changes. That mmap latency thing is
basically due to the new page-cache stuff that made us kick ass on all
the benchmarks that 2.2.x was bad at (ie mindcraft etc).

>*Local* Communication bandwidths in MB/s - bigger is better
>-----------------------------------------------------------
>Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
> UNIX reread reread (libc) (hand) read write
>--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
>rum.normn Linux 2.4.0-t 152 105 98 151 326 138 144 326 171
>rum.normn Linux 2.2.18p 264 106 55 152 326 137 142 326 180
>
>- Mixed fortunes here. A serious boost to TCP bandwidth but pipe bandwidth
>dies a bit

The pipe bandwidth is intimately related to pipe latency. Linux pipes
are fairly small (only 4kB worth of data buffer), so they need good
latency for good performance. Many of the same arguments apply (ie
bandwidth if you actually _use_ the data as opposed to just benchmark
with it is probably better on 2.4.x)

The pipe bandwidth could be fairly easily improved by just doubling the
buffer size (or by using VM tricks), but it's not been something that
anybody has felt was all that important in real life.

Yeah, I'm probably making excuses. Some things in 2.4.x are slower
simply because it has more locks than 2.2.x. The big kernel lock in
2.2.x is actually the best for performance on many single-threaded
things, and going to more fine-grained locking has sometimes cost us.
But some of the changes (the page cache in particular) has caused 2.4.x
to win BIG.

The biggest advantage of 2.4.x as far as I'm concerned is that even on a
pretty regular run-of-the-mill dual CPU machine, 2.4.x just _feels_
better under any CPU load. Much smoother behaviour, mainly because 2.2.x
sometimes holds the kernel lock enough to make scheduling jerky.

But I bet you can find work-loads where the reverse is true also.

                        Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Oct 31 2000 - 21:00:13 EST