Re: SMP performance degradation with sysbench

From: Nick Piggin
Date: Tue Mar 13 2007 - 08:09:23 EST

Next message: Andrew Morton: "Re: [QUICKLIST 0/4] Arch independent quicklists V2"
Previous message: Alan Cox: "Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)"
In reply to: Jakub Jelinek: "Re: SMP performance degradation with sysbench"
Next in thread: Siddha, Suresh B: "Re: SMP performance degradation with sysbench"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Andrea Arcangeli wrote:

On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote:

They'll be sleeping in futex_wait in the kernel, I think. One thread
will hold the critical mutex, some will be off doing their own thing,
but importantly there will be many sleeping for the mutex to become
available.

The initial assumption was that there was zero idle time with threads
= cpus and the idle time showed up only when the number of threads
increased to the double the number of cpus. If the idle time wouldn't
increase with the number of threads, nothing would be suspect.

Well I think more threads ~= more probability that this guy is going to
be preempted while holding the mutex?

This might be why FreeBSD works much better, because it looks like MySQL
actually will set RT scheduling for those processes that take critical
resources.

However, I tested with a bigger system and actually the idle time
comes before we saturate all CPUs. Also, increasing the aggressiveness
of the load balancer did not drop idle time at all, so it is not a case
of some runqueues idle while others have many threads on them.

It'd be interesting to see the sysrq+t after the idle time
increased.

I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
glibc allocator. But I wonder if there are other improvements that glibc
can do here?

My wild guess is that they're allocating memory after taking
futexes. If they do, something like this will happen:

taskA taskB taskC
user lock
mmap_sem lock
mmap sem -> schedule
user lock -> schedule

If taskB wouldn't be there triggering more random trashing over the
mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.

I suspect the real fix is not to allocate memory or to run other
expensive syscalls that can block inside the futex critical sections...

I would agree that it points to MySQL scalability issues, however the
fact that such large gains come from tcmalloc is still interesting.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com -
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrew Morton: "Re: [QUICKLIST 0/4] Arch independent quicklists V2"
Previous message: Alan Cox: "Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)"
In reply to: Jakub Jelinek: "Re: SMP performance degradation with sysbench"
Next in thread: Siddha, Suresh B: "Re: SMP performance degradation with sysbench"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]