[PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
From: Eric Dumazet
Date: Thu Mar 15 2007 - 15:11:27 EST
Hi
I'm pleased to present these patches which improve linux futex performance and
scalability, on both UP, SMP and NUMA configs.
I had this idea last year but I was not understood, probably because I gave
not enough explanations. Sorry if this mail is really long...
Analysis of current linux futex code :
--------------------------------------
A central hash table futex_queues[] holds all contexts (futex_q) of waiting
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
perform lookups or insert/deletion of a futex_q.
When a futex_wait() is done, thread has to :
1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
(calling find_vma()). This validation tells us if the futex use
an inode based store (mapped file), or mm based store (anonymous mem)
2) - compute a hash key
3) - Atomic Increment of reference counter on an inode or a mm
4) - lock part of futex_queues[] hash table
5) - perform the test on value of futex.
(rollback is value != expected_value, returns EWOULDBLOCK)
(various loops if test triggers mm faults)
6) queue the context into hash table, release the lock got in 4)
7) - release the read_lock on mmap_sem
<block>
8) Eventually unqueue the context (but rarely, as this part
may be done by the futex_wake())
Futexes were designed to improve scalability but current implementation
has various problems :
- Central hashtable :
This means scalability problems if many processes/threads want to use
futexes at the same time.
This means NUMA unbalance because this hashtable is located on one node.
- Using mmap_sem on every futex() syscall :
Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
ops on mmap_sem, dirtying cache line :
- lot of cache line ping pongs on SMP configurations.
mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
Highly threaded processes might suffer from mmap_sem contention.
mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.
- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
because of cache misses.
Most of these scalability problems come from the fact that futexes are in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).
We chose to force all futexes be 'shared'. This has a cost.
But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and
optimal performance if carefuly implemented. Time has come for linux to have
better threading performance.
PTHREAD_PROCESS_PRIVATE semantic allows implementation to use separate
repositories :
- One 'global' namespace for all PROCESS_SHARED futexes.
- One "per process private repository" for PROCESS_PRIVATE futexes.
This repository is NUMA aware, it is allocated the first time a process
issues a futex(XXXX_PRIVATE) call. If allocation is not possible because
of memory shortage, we just fallback using the central repository.
The goal is to permit new futex commands to avoid :
- Using the central hash table (still used by PTHREAD_PROCESS_SHARED futexes)
- Taking the mmap_sem semaphore, conflicting with other subsystems.
- Modifying a ref_count on mm or an inode, still conflicting with mm or fs.
This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.
This is why this patches :
1) Define new futex subcommands (basically adding a _PRIVATE flag)
Avoids using mmap_sem, and ref counter on inode or mm.
2) Allows each process to have a private repository (a small hash table)
where its PROCESS_PRIVATE active futexes are stored, instead of
the global repository.
if CONFIG_BASE_SMALL, we still use the global repository
3) NUMA optimization : we allocate the global hash table with vmalloc()
If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.
Only PTHREAD_PROCESS_SHARED futexes should use the old subcommands.
Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)
Note : the same SHARED futex (mapped on a file) can be used by old binaries
*and* new binaries, because both binaries will use the old subcommands.
Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)
Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)
/* calling futex_wait(addr, value) with value != *addr */
450 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
427 cycles per futex(FUTEX_WAIT) call (using one futex)
337 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
332 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
214 cycles per futex(1000) call (returns ENOSYS)
186 cycles per getppid() call
187 cycles per umask() call
182 cycles per ni_syscall() call
Thank you for reading this mail
[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic
[PATCH 2/3] FUTEX : introduce private hashtables
[PATCH 3/3] FUTEX : NUMA friendly global hashtable
Signed-off-by: Eric Dumazet <dada1@xxxxxxxxxxxxx>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/