Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpuoperations in the hotpaths

From: Pekka Enberg
Date: Tue Oct 13 2009 - 15:46:38 EST

Next message: mgross: "Re: [patch 01/28] pm_qos: remove BKL"
Previous message: Joe Perches: "Re: [PATCH v2 3/9] vsprintf: add %pR decoding and %pr for rawstruct resource"
In reply to: Christoph Lameter: "Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpuoperations in the hotpaths"
Next in thread: Christoph Lameter: "Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpuoperations in the hotpaths"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Christoph,

Christoph Lameter wrote:

Here are some cycle numbers w/o the slub patches and with. I will post the
full test results and the patches to do these in kernel tests in a new
thread. The regression may be due to caching behavior of SLUB that will
not change with these patches.

Alloc fastpath wins ~ 50%. kfree also has a 50% win if the fastpath is
being used. First test does 10000 kmallocs and then frees them all.
Second test alloc one and free one and does that 10000 times.

I wonder how reliable these numbers are. We did similar testing a while back because we thought kmalloc-96 caches had weird cache behavior but finally figured out the anomaly was explained by the order of the tests run, not cache size.

AFAICT, we have similar artifact in these tests as well:

no this_cpu ops

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 239 cycles kfree -> 261 cycles
10000 times kmalloc(16) -> 249 cycles kfree -> 208 cycles
10000 times kmalloc(32) -> 215 cycles kfree -> 232 cycles
10000 times kmalloc(64) -> 164 cycles kfree -> 216 cycles

Notice the jump from 32 to 64 and then back to 64. One would expect we see linear increase as object size grows as we hit the page allocator more often, no?

10000 times kmalloc(128) -> 266 cycles kfree -> 275 cycles
10000 times kmalloc(256) -> 478 cycles kfree -> 199 cycles
10000 times kmalloc(512) -> 449 cycles kfree -> 201 cycles
10000 times kmalloc(1024) -> 484 cycles kfree -> 398 cycles
10000 times kmalloc(2048) -> 475 cycles kfree -> 559 cycles
10000 times kmalloc(4096) -> 792 cycles kfree -> 506 cycles
10000 times kmalloc(8192) -> 753 cycles kfree -> 679 cycles
10000 times kmalloc(16384) -> 968 cycles kfree -> 712 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 292 cycles
10000 times kmalloc(16)/kfree -> 308 cycles
10000 times kmalloc(32)/kfree -> 326 cycles
10000 times kmalloc(64)/kfree -> 303 cycles
10000 times kmalloc(128)/kfree -> 257 cycles
10000 times kmalloc(256)/kfree -> 262 cycles
10000 times kmalloc(512)/kfree -> 293 cycles
10000 times kmalloc(1024)/kfree -> 262 cycles
10000 times kmalloc(2048)/kfree -> 289 cycles
10000 times kmalloc(4096)/kfree -> 274 cycles
10000 times kmalloc(8192)/kfree -> 265 cycles
10000 times kmalloc(16384)/kfree -> 1041 cycles

with this_cpu_xx

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 134 cycles kfree -> 212 cycles
10000 times kmalloc(16) -> 109 cycles kfree -> 116 cycles

Same artifact here.

10000 times kmalloc(32) -> 157 cycles kfree -> 231 cycles
10000 times kmalloc(64) -> 168 cycles kfree -> 169 cycles
10000 times kmalloc(128) -> 263 cycles kfree -> 260 cycles
10000 times kmalloc(256) -> 430 cycles kfree -> 251 cycles
10000 times kmalloc(512) -> 415 cycles kfree -> 258 cycles
10000 times kmalloc(1024) -> 406 cycles kfree -> 432 cycles
10000 times kmalloc(2048) -> 457 cycles kfree -> 579 cycles
10000 times kmalloc(4096) -> 624 cycles kfree -> 553 cycles
10000 times kmalloc(8192) -> 851 cycles kfree -> 851 cycles
10000 times kmalloc(16384) -> 907 cycles kfree -> 722 cycles

And looking at these numbers:

2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 232 cycles
10000 times kmalloc(16)/kfree -> 150 cycles
10000 times kmalloc(32)/kfree -> 278 cycles
10000 times kmalloc(64)/kfree -> 263 cycles
10000 times kmalloc(128)/kfree -> 280 cycles
10000 times kmalloc(256)/kfree -> 279 cycles
10000 times kmalloc(512)/kfree -> 299 cycles
10000 times kmalloc(1024)/kfree -> 289 cycles
10000 times kmalloc(2048)/kfree -> 288 cycles
10000 times kmalloc(4096)/kfree -> 321 cycles
10000 times kmalloc(8192)/kfree -> 285 cycles
10000 times kmalloc(16384)/kfree -> 1002 cycles

If there's 50% improvement in the kmalloc() path, why does the this_cpu() version seem to be roughly as fast as the mainline version?

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: mgross: "Re: [patch 01/28] pm_qos: remove BKL"
Previous message: Joe Perches: "Re: [PATCH v2 3/9] vsprintf: add %pR decoding and %pr for rawstruct resource"
In reply to: Christoph Lameter: "Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpuoperations in the hotpaths"
Next in thread: Christoph Lameter: "Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpuoperations in the hotpaths"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]