Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpuoperations in the hotpaths

From: Mel Gorman
Date: Wed Oct 14 2009 - 09:36:18 EST


On Tue, Oct 13, 2009 at 03:53:00PM -0700, David Rientjes wrote:
> On Tue, 13 Oct 2009, Christoph Lameter wrote:
>
> > > For an optimized fastpath, I'd expect such a workload would result in at
> > > least a slightly higher transfer rate.
> >
> > There will be no improvements if the load is dominated by the
> > instructions in the network layer or caching issues. None of that is
> > changed by the path. It only reduces the cycle count in the fastpath.
> >
>
> Right, but CONFIG_SLAB shows a 5-6% improvement over CONFIG_SLUB in the
> same workload so it shows that the slab allocator does have an impact in
> transfer rate. I understand that the performance gain with this patchset,
> however, may not be representative with the benchmark since it also
> frequently uses the slowpath for kmalloc-256 about 25% of the time and the
> added code of the irqless patch may mask the fastpath gain.
>

I have a bit more detailed results based on the following machine

CPU type: AMD Phenom 9950
CPU counts: 1 CPU (4 cores)
CPU Speed: 1.3GHz
Motherboard: Gigabyte GA-MA78GM-S2H
Memory: 8GB

The reference kernel used is mmotm-2009-10-09-01-07. The patches applied
are the patches in this thread. The headings are a bit munged but it's

SLUB-vanilla where vanilla is mmotm-2009-10-09-01-07
SLUB-this-cpu mmotm-2009-10-09-01-07 + patches in this thread
SLAB-* same as above but SLAB configured instead of SLUB.
I know it wasn't necessary to run SLAB-this-cpu but
it gives an idea to what degree results can vary
between reboots even if results are stable once the
machine is running.

The benchmarks run were kernbench, netperf UDP_STREAM and TCP_STREAM and
sysbench with postgres.

Kernbench is 5 kernel compiles and an average taken. One kernel compile
is done at the start to warm the benchmark up and this result is
discarded.

Netperf is the _STREAM tests as opposed to the _RR tests reported
elsewhere. No special effort is done to bind processes to any particular
CPU. The results reported tried to be 99% confidence that the estimated
mean was within 1% of the true mean. Results where netperf failed to
achieve the necessary confidence are marked with a * and the line after
such a result states what percentage the estimated mean is to the true
mean. The test is run with different packet sizes.

Sysbench is a read-only test (to avoid IO) and is the "complex"
workload. The test is run with varying numbers of threads.

In all the results, SLUB-vanilla is the reference baseline. This allows
a comparison between SLUB-vanilla and SLAB-vanilla as well with the
patches applied.

kernbench-SLUB-vanilla-kernbench kernbench-SLUBkernbench-SLAB-vanilla-kernbench kernbench-SLAB
SLUB-vanilla this-cpu SLAB-vanilla this-cpu
Elapsed min 92.95 ( 0.00%) 92.62 ( 0.36%) 92.93 ( 0.02%) 92.62 ( 0.36%)
Elapsed mean 93.11 ( 0.00%) 92.74 ( 0.40%) 93.00 ( 0.13%) 92.82 ( 0.32%)
Elapsed stddev 0.10 ( 0.00%) 0.14 (-40.55%) 0.04 (55.47%) 0.18 (-84.33%)
Elapsed max 93.20 ( 0.00%) 92.95 ( 0.27%) 93.05 ( 0.16%) 93.09 ( 0.12%)
User min 323.21 ( 0.00%) 322.60 ( 0.19%) 322.50 ( 0.22%) 323.26 (-0.02%)
User mean 323.81 ( 0.00%) 323.20 ( 0.19%) 323.16 ( 0.20%) 323.54 ( 0.08%)
User stddev 0.40 ( 0.00%) 0.46 (-15.30%) 0.48 (-20.92%) 0.29 (26.07%)
User max 324.32 ( 0.00%) 323.72 ( 0.19%) 323.86 ( 0.14%) 323.98 ( 0.10%)
System min 35.95 ( 0.00%) 35.50 ( 1.25%) 35.35 ( 1.67%) 36.01 (-0.17%)
System mean 36.30 ( 0.00%) 35.96 ( 0.96%) 36.17 ( 0.36%) 36.23 ( 0.21%)
System stddev 0.25 ( 0.00%) 0.45 (-75.60%) 0.56 (-121.14%) 0.14 (46.14%)
System max 36.65 ( 0.00%) 36.67 (-0.05%) 36.94 (-0.79%) 36.39 ( 0.71%)
CPU min 386.00 ( 0.00%) 386.00 ( 0.00%) 386.00 ( 0.00%) 386.00 ( 0.00%)
CPU mean 386.25 ( 0.00%) 386.75 (-0.13%) 386.00 ( 0.06%) 387.25 (-0.26%)
CPU stddev 0.43 ( 0.00%) 0.83 (-91.49%) 0.00 (100.00%) 0.83 (-91.49%)
CPU max 387.00 ( 0.00%) 388.00 (-0.26%) 386.00 ( 0.26%) 388.00 (-0.26%)

Small gains in the User, System and Elapsed times with this-cpu patches
applied. It is interest to note for the mean times that the patches more
than close the gap between SLUB and SLAB for the most part - the
exception being User which has marginally better performance. This might
indicate that SLAB is still slightly better at giving back cache-hot
memory but this is speculation.

NETPERF UDP_STREAM
Packet netperf-udp udp-SLUB netperf-udp udp-SLAB
Size SLUB-vanilla this-cpu SLAB-vanilla this-cpu
64 148.48 ( 0.00%) 152.03 ( 2.34%) 147.45 (-0.70%) 150.07 ( 1.06%)
128 294.65 ( 0.00%) 299.92 ( 1.76%) 289.20 (-1.88%) 290.15 (-1.55%)
256 583.63 ( 0.00%) 609.14 ( 4.19%) 590.78 ( 1.21%) 586.42 ( 0.48%)
1024 2217.90 ( 0.00%) 2261.99 ( 1.95%) 2219.64 ( 0.08%) 2207.93 (-0.45%)
2048 4164.27 ( 0.00%) 4161.47 (-0.07%) 4216.46 ( 1.24%) 4155.11 (-0.22%)
3312 6284.17 ( 0.00%) 6383.24 ( 1.55%) 6231.88 (-0.84%) 6243.82 (-0.65%)
4096 7399.42 ( 0.00%) 7686.38 ( 3.73%) 7394.89 (-0.06%) 7487.91 ( 1.18%)
6144 10014.35 ( 0.00%) 10199.48 ( 1.82%) 9927.92 (-0.87%)* 10067.40 ( 0.53%)
1.00% 1.00% 1.08% 1.00%
8192 11232.50 ( 0.00%)* 11368.13 ( 1.19%)* 12280.88 ( 8.54%)* 12244.23 ( 8.26%)
1.65% 1.64% 1.32% 1.00%
10240 12961.87 ( 0.00%) 13099.82 ( 1.05%)* 13816.33 ( 6.18%)* 13927.18 ( 6.93%)
1.00% 1.03% 1.21% 1.00%
12288 14403.74 ( 0.00%)* 14276.89 (-0.89%)* 15173.09 ( 5.07%)* 15464.05 ( 6.86%)*
1.31% 1.63% 1.93% 1.55%
14336 15229.98 ( 0.00%)* 15218.52 (-0.08%)* 16412.94 ( 7.21%) 16252.98 ( 6.29%)
1.37% 2.76% 1.00% 1.00%
16384 15367.60 ( 0.00%)* 16038.71 ( 4.18%) 16635.91 ( 7.62%) 17128.87 (10.28%)*
1.29% 1.00% 1.00% 6.36%

The patches mostly improve the performance of netperf UDP_STREAM by a good
whack so the patches are a plus here. However, it should also be noted that
SLAB was mostly faster than SLUB, particularly for large packet sizes. Refresh
my memory, how do SLUB and SLAB differ in regards to off-loading large
allocations to the page allocator these days?

NETPERF TCP_STREAM
Packet netperf-tcp tcp-SLUB netperf-tcp tcp-SLAB
Size SLUB-vanilla this-cpu SLAB-vanilla this-cpu
64 1773.00 ( 0.00%) 1731.63 (-2.39%)* 1794.48 ( 1.20%) 2029.46 (12.64%)
1.00% 2.43% 1.00% 1.00%
128 3181.12 ( 0.00%) 3471.22 ( 8.36%) 3296.37 ( 3.50%) 3251.33 ( 2.16%)
256 4794.35 ( 0.00%) 4797.38 ( 0.06%) 4912.99 ( 2.41%) 4846.86 ( 1.08%)
1024 9438.10 ( 0.00%) 8681.05 (-8.72%)* 8270.58 (-14.12%) 8268.85 (-14.14%)
1.00% 7.31% 1.00% 1.00%
2048 9196.06 ( 0.00%) 9375.72 ( 1.92%) 11474.59 (19.86%) 9420.01 ( 2.38%)
3312 10338.49 ( 0.00%)* 10021.82 (-3.16%)* 12018.72 (13.98%)* 12069.28 (14.34%)*
9.49% 6.36% 1.21% 2.12%
4096 9931.20 ( 0.00%)* 10285.38 ( 3.44%)* 12265.59 (19.03%)* 10175.33 ( 2.40%)*
1.31% 1.38% 9.97% 8.33%
6144 12775.08 ( 0.00%)* 10559.63 (-20.98%) 13139.34 ( 2.77%) 13210.79 ( 3.30%)*
1.45% 1.00% 1.00% 2.99%
8192 10933.93 ( 0.00%)* 10534.41 (-3.79%)* 10876.42 (-0.53%)* 10738.25 (-1.82%)*
14.29% 2.10% 12.50% 9.55%
10240 12868.58 ( 0.00%) 12991.65 ( 0.95%) 10892.20 (-18.14%) 13106.01 ( 1.81%)
12288 11854.97 ( 0.00%) 12122.34 ( 2.21%)* 12129.79 ( 2.27%)* 12411.84 ( 4.49%)*
1.00% 6.61% 5.78% 8.95%
14336 12552.48 ( 0.00%)* 12501.71 (-0.41%)* 12274.54 (-2.26%) 12322.63 (-1.87%)*
6.05% 2.58% 1.00% 2.23%
16384 11733.09 ( 0.00%)* 12735.05 ( 7.87%)* 13195.68 (11.08%)* 14401.62 (18.53%)
1.14% 9.79% 10.30% 1.00%

The results for the patches are a bit all over the place for TCP_STREAM
with big gains and losses depending on the packet size, particularly 6144
for some reason. SLUB vs SLAB shows SLAB often has really massive advantages
and this is not always for the larger packet sizes where the page allocator
might be a suspect.

SYSBENCH
sysbench-SLUB-vanilla-sysbench sysbench-SLUBsysbench-SLAB-vanilla-sysbench sysbench-SLAB
SLUB-vanilla this-cpu SLAB-vanilla this-cpu
1 26950.79 ( 0.00%) 26822.05 (-0.48%) 26919.89 (-0.11%) 26746.18 (-0.77%)
2 51555.51 ( 0.00%) 51928.02 ( 0.72%) 51370.02 (-0.36%) 51129.82 (-0.83%)
3 76204.23 ( 0.00%) 76333.58 ( 0.17%) 76483.99 ( 0.37%) 75954.52 (-0.33%)
4 100599.12 ( 0.00%) 101757.98 ( 1.14%) 100499.65 (-0.10%) 101605.61 ( 0.99%)
5 100211.45 ( 0.00%) 100435.33 ( 0.22%) 100150.98 (-0.06%) 99398.11 (-0.82%)
6 99390.81 ( 0.00%) 99840.85 ( 0.45%) 99234.38 (-0.16%) 99244.42 (-0.15%)
7 98740.56 ( 0.00%) 98727.61 (-0.01%) 98305.88 (-0.44%) 98123.56 (-0.63%)
8 98075.89 ( 0.00%) 98048.62 (-0.03%) 98183.99 ( 0.11%) 97587.82 (-0.50%)
9 96502.22 ( 0.00%) 97276.80 ( 0.80%) 96819.88 ( 0.33%) 97320.51 ( 0.84%)
10 96598.70 ( 0.00%) 96545.37 (-0.06%) 96222.51 (-0.39%) 96221.69 (-0.39%)
11 95500.66 ( 0.00%) 95671.11 ( 0.18%) 95003.21 (-0.52%) 95246.81 (-0.27%)
12 94572.87 ( 0.00%) 95266.70 ( 0.73%) 93807.60 (-0.82%) 94859.82 ( 0.30%)
13 93811.85 ( 0.00%) 94309.18 ( 0.53%) 93219.81 (-0.64%) 93051.63 (-0.82%)
14 92972.16 ( 0.00%) 93849.87 ( 0.94%) 92641.50 (-0.36%) 92916.70 (-0.06%)
15 92276.06 ( 0.00%) 92454.94 ( 0.19%) 91094.04 (-1.30%) 91972.79 (-0.33%)
16 90265.35 ( 0.00%) 90416.26 ( 0.17%) 89309.26 (-1.07%) 90103.89 (-0.18%)

The patches mostly gain for sysbench although the gains are very marginal
and SLUB has a minor advantage over SLAB. I haven't actually checked how
slab-intensive this workload is. The differences are no marginal, I would
guess the answer is "not very".

Overall based on these results, I would say that the patches are a "Good Thing"
for this machine at least. With the patches applied, SLUB has a marginal
advantage over SLAB for kernbench. However, netperf TCP_STREAM and UDP_STREAM
both show significant disadvantages for SLUB and this cannot be always
explained by differing behaviour with respect to page-allocator offloading.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/