Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
From: Mathieu Desnoyers
Date: Tue Jul 10 2007 - 04:27:22 EST
* Christoph Lameter (clameter@xxxxxxx) wrote:
> Ok here is a replacement patch for the cmpxchg patch. Problems
>
> 1. cmpxchg_local is not available on all arches. If we wanted to do
> this then it needs to be universally available.
>
cmpxchg_local is not available on all archs, but local_cmpxchg is. It
expects a local_t type which is nothing else than a long. When the local
atomic operation is not more efficient or not implemented on a given
architecture, asm-generic/local.h falls back on atomic_long_t. If you
want, you could work on the local_t type, which you could cast from a
long to a pointer when you need so, since their size are, AFAIK, always
the same (and some VM code even assume this is always the case).
> 2. cmpxchg_local does generate the "lock" prefix. It should not do that.
> Without fixes to cmpxchg_local we cannot expect maximum performance.
>
Yup, see the patch I just posted for this.
> 3. The approach is x86 centric. It relies on a cmpxchg that does not
> synchronize with memory used by other cpus and therefore is more
> lightweight. As far as I know the IA64 cmpxchg cannot do that.
> Neither several other processors. I am not sure how cmpxchgless
> platforms would use that. We need a detailed comparison of
> interrupt enable /disable vs. cmpxchg cycle counts for cachelines in
> the cpu cache to evaluate the impact that such a change would have.
>
> The cmpxchg (or its emulation) does not need any barriers since the
> accesses can only come from a single processor.
>
Yes, expected improvements goes as follow:
x86, x86_64 : must faster due to non-LOCKed cmpxchg
alpha: should be faster due to memory barrier removal
mips: memory barriers removed
powerpc 32/64: memory barriers removed
On other architectures, either there is no better implementation than
the standard atomic cmpxchg or it just has not been implemented.
I guess that a test series that would tell us how must improvement is
seen on the optimized architectures (local cmpxchg vs interrupt
enable/disable) and also what effect the standard cmpxchg has compared
to interrupt disable/enable on the architectures where we can't do
better than the standard cmpxchg will tell us if it is an interesting
way to go. I would be happy to do these tests, but I don't have the
hardware handy. I provide a test module to get these characteristics
from various architectures in this email.
> Mathieu measured a significant performance benefit coming from not using
> interrupt enable / disable.
>
> Some rough processor cycle counts (anyone have better numbers?)
>
> STI CLI CMPXCHG
> IA32 36 26 1 (assume XCHG == CMPXCHG, sti/cli also need stack pushes/pulls)
> IA64 12 12 1 (but ar.ccv needs 11 cycles to set comparator,
> need register moves to preserve processors flags)
>
The measurements I get (in cycles):
enable interrupts (STI) disable interrupts (CLI) local CMPXCHG
IA32 (P4) 112 82 26
x86_64 AMD64 125 102 19
> Looks like STI/CLI is pretty expensive and it seems that we may be able to
> optimize the alloc / free hotpath quite a bit if we could drop the
> interrupt enable / disable. But we need some measurements.
>
>
> Draft of a new patch:
>
> SLUB: Single atomic instruction alloc/free using cmpxchg_local
>
> A cmpxchg allows us to avoid disabling and enabling interrupts. The cmpxchg
> is optimal to allow operations on per cpu freelist. We can stay on one
> processor by disabling preemption() and allowing concurrent interrupts
> thus avoiding the overhead of disabling and enabling interrupts.
>
> Pro:
> - No need to disable interrupts.
> - Preempt disable /enable vanishes on non preempt kernels
> Con:
> - Slightly complexer handling.
> - Updates to atomic instructions needed
>
> Signed-off-by: Christoph Lameter <clameter@xxxxxxx>
>
Test local cmpxchg vs int disable/enable. Please run on a 2.6.22 kernel
(or recent 2.6.21-rcX-mmX) (with my cmpxchg local fix patch for x86_64).
Make sure the TSC reads (get_cycles()) are reliable on your platform.
Mathieu
/* test-cmpxchg-nolock.c
*
* Compare local cmpxchg with irq disable / enable.
*/
#include <linux/jiffies.h>
#include <linux/compiler.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/calc64.h>
#include <asm/timex.h>
#include <asm/system.h>
#define NR_LOOPS 20000
int test_val = 0;
static void do_test_cmpxchg(void)
{
int ret;
long flags;
unsigned int i;
cycles_t time1, time2, time;
long rem;
local_irq_save(flags);
preempt_disable();
time1 = get_cycles();
for (i = 0; i < NR_LOOPS; i++) {
ret = cmpxchg_local(&test_val, 0, 0);
}
time2 = get_cycles();
local_irq_restore(flags);
preempt_enable();
time = time2 - time1;
printk(KERN_ALERT "test results: time for non locked cmpxchg\n");
printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
printk(KERN_ALERT "total time: %llu\n", time);
time = div_long_long_rem(time, NR_LOOPS, &rem);
printk(KERN_ALERT "-> non locked cmpxchg takes %llu cycles\n", time);
printk(KERN_ALERT "test end\n");
}
/*
* This test will have a higher standard deviation due to incoming interrupts.
*/
static void do_test_enable_int(void)
{
long flags;
unsigned int i;
cycles_t time1, time2, time;
long rem;
local_irq_save(flags);
preempt_disable();
time1 = get_cycles();
for (i = 0; i < NR_LOOPS; i++) {
local_irq_restore(flags);
}
time2 = get_cycles();
local_irq_restore(flags);
preempt_enable();
time = time2 - time1;
printk(KERN_ALERT "test results: time for enabling interrupts (STI)\n");
printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
printk(KERN_ALERT "total time: %llu\n", time);
time = div_long_long_rem(time, NR_LOOPS, &rem);
printk(KERN_ALERT "-> enabling interrupts (STI) takes %llu cycles\n",
time);
printk(KERN_ALERT "test end\n");
}
static void do_test_disable_int(void)
{
unsigned long flags, flags2;
unsigned int i;
cycles_t time1, time2, time;
long rem;
local_irq_save(flags);
preempt_disable();
time1 = get_cycles();
for ( i = 0; i < NR_LOOPS; i++) {
local_irq_save(flags2);
}
time2 = get_cycles();
local_irq_restore(flags);
preempt_enable();
time = time2 - time1;
printk(KERN_ALERT "test results: time for disabling interrupts (CLI)\n");
printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
printk(KERN_ALERT "total time: %llu\n", time);
time = div_long_long_rem(time, NR_LOOPS, &rem);
printk(KERN_ALERT "-> disabling interrupts (CLI) takes %llu cycles\n",
time);
printk(KERN_ALERT "test end\n");
}
static int ltt_test_init(void)
{
printk(KERN_ALERT "test init\n");
do_test_cmpxchg();
do_test_enable_int();
do_test_disable_int();
return -EAGAIN; /* Fail will directly unload the module */
}
static void ltt_test_exit(void)
{
printk(KERN_ALERT "test exit\n");
}
module_init(ltt_test_init)
module_exit(ltt_test_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mathieu Desnoyers");
MODULE_DESCRIPTION("Cmpxchg local test");
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/