Re: [PATCH v8 0/9] rwsem performance optimizations
From: Tim Chen
Date: Mon Nov 04 2013 - 17:37:13 EST
Ingo,
Sorry for the late response. My old 4 socket Westmere
test machine went down and I have to find a new one,
which is a 4 socket Ivybridge machine with 15 cores per socket.
I've updated the workload as a perf benchmark (see patch)
attached. The workload will mmap, then access memory
in the mmaped area and then unmap, doing so repeatedly
for a specified time. Each thread is pinned to a
particular core, with the threads distributed evenly between
the sockets. The throughput is reported with standard deviation
info.
First some baseline comparing the workload with serialized mmap vs
without serialized mmap running under vanilla kernel.
Threads Throughput std dev(%)
serail vs non serial
mmap(%)
1 0.10 0.16
2 0.78 0.09
3 -5.00 0.12
4 -3.27 0.08
5 -0.11 0.09
10 5.32 0.10
20 -2.05 0.05
40 -9.75 0.15
60 11.69 0.05
Here's the data for complete rwsem patch vs the plain vanilla kernel
case. Overall there's improvement except for the 3 thread case.
Threads Throughput std dev(%)
vs vanilla(%)
1 0.62 0.11
2 3.86 0.10
3 -7.02 0.19
4 -0.01 0.13
5 2.74 0.06
10 5.66 0.03
20 1.44 0.09
40 5.54 0.09
60 15.63 0.13
Now testing with both patched kernel and vanilla kernel
running serialized mmap with mutex acquisition in user space.
Threads Throughput std dev(%)
vs vanilla(%)
1 0.60 0.02
2 6.40 0.11
3 14.13 0.07
4 -2.41 0.07
5 1.05 0.08
10 4.15 0.05
20 -0.26 0.06
40 -3.45 0.13
60 -4.33 0.07
Here's another run with the rwsem patchset without
optimistic spinning
Threads Throughput std dev(%)
vs vanilla(%)
1 0.81 0.04
2 2.85 0.17
3 -4.09 0.05
4 -8.31 0.07
5 -3.19 0.03
10 1.02 0.05
20 -4.77 0.04
40 -3.11 0.10
60 2.06 0.10
No-optspin comparing serialized mmaped workload under
patched kernel vs vanilla kernel
Threads Throughput std dev(%)
vs vanilla(%)
1 0.57 0.03
2 2.13 0.17
3 14.78 0.33
4 -1.23 0.11
5 2.99 0.08
10 -0.43 0.10
20 0.01 0.03
40 3.03 0.10
60 -1.74 0.09
The data is a bit of a mixed bag. I'll spin off
the MCS cleanup patch separately so we can merge that first
for Waiman's qrwlock work.
Tim
---