Re: [Bugme-new] [Bug 15618] New: 2.6.18->2.6.32->2.6.33 hugeregression in performance

From: Lee Schermerhorn
Date: Fri Apr 02 2010 - 14:57:41 EST

On Tue, 2010-03-23 at 10:22 -0400, Andrew Morton wrote:
> (switched to email. Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> On Tue, 23 Mar 2010 16:13:25 GMT bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote:
> >
> >
> > Summary: 2.6.18->2.6.32->2.6.33 huge regression in performance
> > Product: Process Management
> > Version: 2.5
> > Kernel Version: 2.6.32
> > Platform: All
> > OS/Version: Linux
> > Tree: Mainline
> > Status: NEW
> > Severity: high
> > Priority: P1
> > Component: Other
> > AssignedTo: process_other@xxxxxxxxxxxxxxxxxxxx
> > ReportedBy: ant.starikov@xxxxxxxxx
> > Regression: No
> >
> >
> > We have benchmarked some multithreaded code here on 16-core/4-way opteron 8356
> > host on number of kernels (see below) and found strange results.
> > Up to 8 threads we didn't see any noticeable differences in performance, but
> > starting from 9 threads performance diverges substantially. I provide here
> > results for 14 threads
> lolz. Catastrophic meltdown. Thanks for doing all that work - at a
> guess I'd say it's mmap_sem. Perhaps with some assist from the CPU
> scheduler.
> If you change the config to set CONFIG_RWSEM_GENERIC_SPINLOCK=n,
> Anyway, there's a testcase in bugzilla and it looks like we got us some
> work to do.

I had an "opportunity" to investigate page fault behavior on 2.6.18+
[RHEL5.4] on an 8-socket Istanbul system earlier this year. When I saw
this mail, I collected up the data I had from that adventure and ran
additional tests on 2.6.33 and 2.6.34-rc1. I have attached plots for
what "per node" and "system wide" page fault scalability.

The per node plot [#1] shows the page fault rate of 1 to 6
[nr_cores_per_socket] tasks [processes] and threads faulting in a fixed
GB/task at the same time on a single socket. The system wide plot [#3]
show 1 to 48 [nr_sockets * nr_cores_per_socket] tasks and threads again
faulting in a fixed GB/task... For the latter test, I load one core
per socket at at time, then add the 2nd core per socket, ... In all
cases, the individual tasks/threads are fork()ed/pthread_create()d by a
parent bound to the cpu where they'll run to obtain node-local kernel
data structures. The tests run with SCHED_FIFO.

I plot both "faults per wall clock second"--the aggregate rate--and
"faults per cpu second" or normalized rate. The per node scalability
doesn't look all that different across the 3 releases, especially the
faults per cpu seconds curves. However, in the system wide
multi-threaded tests, 2.6.33 is an anomaly compared to both 2.6.18+ and
2.6.34-rc1. The 2.6.18+ and 2.6.34.rc1 multi-threaded tests show a lot
of noise and, of course, a lot lower fault rate relative the the
multi-task tests. I aborted the 2.6.33 system wide multi-threaded test
at 32 threads because it was just taking too long.

Unfortunately, with this many curves, the legends obscure much of the
plot. So, rather than bloat this message any more, I've packaged up the
raw data along with plots with and without legends and placed the
tarball here:

That directory also contains the source for the version of the pft test
used, along with the scripts used to run the tests and plot the results.
Note that some manual editing of the "plot annotations" in the raw data
was required to generate several different plots from the same data.

The pft test is a highly, uh, "evolved" version of pft.c that Christoph
Lameter pointed me at a few years ago. This version requires a patched
libnuma with the v2 api. The required patch to the numactl-2.0.3
package is included in the test tarball. [I've contacted Cliff about
getting the patch into 2.0.4.]


Attachment: 1-pft-istanbul_per_node_task_vs_thread_18v33v34rc1.png
Description: PNG image

Attachment: 3-pft-istanbul_task_and_thread_18v33v34rc1.png
Description: PNG image