[PATCH 0/3] Add NUMA-awareness to qspinlock

From: Alex Kogan
Date: Wed Jan 30 2019 - 22:12:58 EST


Lock throughput can be increased by handing a lock to a waiter on the
same NUMA socket as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA sockets. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock.

CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
organized in two queues, a main queue for threads running on the same
socket as the current lock holder, and a secondary queue for threads
running on other sockets. Threads record the ID of the socket on which
they are running in their queue nodes. At the unlock time, the lock
holder scans the main queue looking for a thread running on the same
socket. If found (call it thread T), all threads in the main queue
between the current lock holder and T are moved to the end of the
secondary queue, and the lock is passed to T. If such T is not found, the
lock is passed to the first node in the secondary queue. Finally, if the
secondary queue is empty, the lock is passed to the next thread in the
main queue.

Full details are available at https://arxiv.org/abs/1810.05600.

We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 5 runs) of the
total number of ops (x10^7) reported at the end of each run. The stock
kernel is v4.20.0-rc4+ compiled in the default configuration.

#thr stock patched speedup (patched/stock)
1 2.710 2.715 1.002
2 3.108 3.001 0.966
4 4.194 3.919 0.934
8 5.309 6.894 1.299
16 6.722 9.094 1.353
32 7.314 9.885 1.352
36 7.562 9.855 1.303
72 6.696 10.358 1.547
108 6.364 10.181 1.600
142 6.179 10.178 1.647

When the kernel is compiled with lockstat enabled, CNA
achieves even larger speedups:

#thr stock patched speedup
1 2.368 2.399 1.013
2 2.567 2.492 0.970
4 2.310 2.534 1.097
8 2.893 4.468 1.544
16 3.786 5.611 1.482
32 4.097 5.578 1.362
36 4.165 5.661 1.359
72 3.484 5.841 1.677
108 2.890 5.498 1.903
142 2.695 5.356 1.987

This is because lockstat generates writes into shared variables inside the
critical section to update various stats (e.g., the last CPU on which a
lock was acquired). By keeping the lock local on a socket, CNA reduces the
number of remote cache misses on the access to the lock itself as well as
to the critical section data.

The following tables contain throughput results (ops/us) from the same
setup for will-it-scale/open1_threads (with the kernel compiled in the
default configuration):

#thr stock patched speedup
1 0.553 0.579 1.046
2 0.860 0.907 1.054
4 1.503 1.533 1.020
8 1.735 1.704 0.983
16 1.757 1.744 0.992
32 0.888 1.705 1.921
36 0.878 1.746 1.988
72 0.844 1.766 2.094
108 0.812 1.747 2.150
142 0.804 1.767 2.198

and will-it-scale/lock2_threads:

#thr stock patched speedup
1 1.714 1.704 0.994
2 2.919 2.914 0.998
4 5.024 5.157 1.027
8 4.101 3.946 0.962
16 4.113 3.947 0.959
32 2.618 4.145 1.583
36 2.561 3.981 1.554
72 2.062 4.015 1.947
108 2.157 3.977 1.844
142 1.992 3.916 1.966

As a part of correctness testing, we performed kernel builds on the patched
kernel with X*NCPU parallelism, for X=1,3,5.

Code reviews and performance testing are welcome and appreciated.


Alex Kogan (3):
locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic
locking/qspinlock: Introduce CNA into the slow path of qspinlock
locking/qspinlock: Introduce starvation avoidance into CNA

arch/arm/include/asm/mcs_spinlock.h | 4 +-
include/asm-generic/qspinlock_types.h | 10 ++
kernel/locking/mcs_spinlock.h | 21 +++-
kernel/locking/qspinlock.c | 211 ++++++++++++++++++++++++++++++----
4 files changed, 218 insertions(+), 28 deletions(-)

--
2.11.0 (Apple Git-81)