Re: [PATCH v2 0/5] Add NUMA-awareness to qspinlock
From: Jan Glauber
Date: Wed Jul 03 2019 - 07:58:26 EST
Hi Alex,
I've tried this series on arm64 (ThunderX2 with up to SMT=4 and 224 CPUs)
with the borderline testcase of accessing a single file from all
threads. With that
testcase the qspinlock slowpath is the top spot in the kernel.
The results look really promising:
CPUs normal numa-qspinlocks
---------------------------------------------
56 149.41 73.90
224 576.95 290.31
Also frontend-stalls are reduced to 50% and interconnect traffic is
greatly reduced.
Tested-by: Jan Glauber <jglauber@xxxxxxxxxxx>
--Jan
Am Fr., 29. MÃrz 2019 um 16:23 Uhr schrieb Alex Kogan <alex.kogan@xxxxxxxxxx>:
>
> This version addresses feedback from Peter and Waiman. In particular,
> the CNA functionality has been moved to a separate file, and is controlled
> by a config option (enabled by default if NUMA is enabled).
> An optimization has been introduced to reduce the overhead of shuffling
> threads between waiting queues when the lock is only lightly contended.
>
> Summary
> -------
>
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA node as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA nodes. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock. It can be
> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
>
> CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
> organized in two queues, a main queue for threads running on the same
> node as the current lock holder, and a secondary queue for threads
> running on other nodes. Threads store the ID of the node on which
> they are running in their queue nodes. At the unlock time, the lock
> holder scans the main queue looking for a thread running on the same
> node. If found (call it thread T), all threads in the main queue
> between the current lock holder and T are moved to the end of the
> secondary queue, and the lock is passed to T. If such T is not found, the
> lock is passed to the first node in the secondary queue. Finally, if the
> secondary queue is empty, the lock is passed to the next thread in the
> main queue. To avoid starvation of threads in the secondary queue,
> those threads are moved back to the head of the main queue
> after a certain expected number of intra-node lock hand-offs.
>
> More details are available at https://arxiv.org/abs/1810.05600.
>
> We have done some performance evaluation with the locktorture module
> as well as with several benchmarks from the will-it-scale repo.
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). Each number represents an average (over 25 runs) of the
> total number of ops (x10^7) reported at the end of each run. The
> standard deviation is also reported in (), and in general, with a few
> exceptions, is about 3%. The 'stock' kernel is v5.0-rc8,
> commit 28d49e282665 ("locking/lockdep: Shrink struct lock_class_key"),
> compiled in the default configuration. 'patch' is the modified
> kernel compiled with NUMA_AWARE_SPINLOCKS not set; it is included to show
> that any performance changes to the existing qspinlock implementation are
> essentially noise. 'patch-CNA' is the modified kernel with
> NUMA_AWARE_SPINLOCKS set; the speedup is calculated dividing
> 'patch-CNA' by 'stock'.
>
> #thr stock patch patch-CNA speedup (patch-CNA/stock)
> 1 2.731 (0.102) 2.732 (0.093) 2.716 (0.082) 0.995
> 2 3.071 (0.124) 3.084 (0.109) 3.079 (0.113) 1.003
> 4 4.221 (0.138) 4.229 (0.087) 4.408 (0.103) 1.044
> 8 5.366 (0.154) 5.274 (0.094) 6.958 (0.233) 1.297
> 16 6.673 (0.164) 6.689 (0.095) 8.547 (0.145) 1.281
> 32 7.365 (0.177) 7.353 (0.183) 9.305 (0.202) 1.263
> 36 7.473 (0.198) 7.422 (0.181) 9.441 (0.196) 1.263
> 72 6.805 (0.182) 6.699 (0.170) 10.020 (0.218) 1.472
> 108 6.509 (0.082) 6.480 (0.115) 10.027 (0.194) 1.540
> 142 6.223 (0.109) 6.294 (0.100) 9.874 (0.183) 1.587
>
> The following tables contain throughput results (ops/us) from the same
> setup for will-it-scale/open1_threads:
>
> #thr stock patch patch-CNA speedup (patch-CNA/stock)
> 1 0.565 (0.004) 0.567 (0.001) 0.565 (0.003) 0.999
> 2 0.892 (0.021) 0.899 (0.022) 0.900 (0.018) 1.009
> 4 1.503 (0.031) 1.527 (0.038) 1.481 (0.025) 0.985
> 8 1.755 (0.105) 1.714 (0.079) 1.683 (0.106) 0.959
> 16 1.740 (0.095) 1.752 (0.087) 1.693 (0.098) 0.973
> 32 0.884 (0.080) 0.908 (0.090) 1.686 (0.092) 1.906
> 36 0.907 (0.095) 0.894 (0.088) 1.709 (0.081) 1.885
> 72 0.856 (0.041) 0.858 (0.043) 1.707 (0.082) 1.994
> 108 0.858 (0.039) 0.869 (0.037) 1.732 (0.076) 2.020
> 142 0.809 (0.044) 0.854 (0.044) 1.728 (0.083) 2.135
>
> and will-it-scale/lock2_threads:
>
> #thr stock patch patch-CNA speedup (patch-CNA/stock)
> 1 1.713 (0.004) 1.715 (0.004) 1.711 (0.004) 0.999
> 2 2.889 (0.057) 2.864 (0.078) 2.876 (0.066) 0.995
> 4 4.582 (1.032) 5.066 (0.787) 4.725 (0.959) 1.031
> 8 4.227 (0.196) 4.104 (0.274) 4.092 (0.365) 0.968
> 16 4.108 (0.141) 4.057 (0.138) 4.010 (0.168) 0.976
> 32 2.674 (0.125) 2.625 (0.171) 3.958 (0.156) 1.480
> 36 2.622 (0.107) 2.553 (0.150) 3.978 (0.116) 1.517
> 72 2.009 (0.090) 1.998 (0.092) 3.932 (0.114) 1.957
> 108 2.154 (0.069) 2.089 (0.090) 3.870 (0.081) 1.797
> 142 1.953 (0.106) 1.943 (0.111) 3.853 (0.100) 1.973
>
> Further comments are welcome and appreciated.
>
> Alex Kogan (5):
> locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic
> locking/qspinlock: Refactor the qspinlock slow path
> locking/qspinlock: Introduce CNA into the slow path of qspinlock
> locking/qspinlock: Introduce starvation avoidance into CNA
> locking/qspinlock: Introduce the shuffle reduction optimization into
> CNA
>
> arch/arm/include/asm/mcs_spinlock.h | 4 +-
> arch/x86/Kconfig | 14 ++
> include/asm-generic/qspinlock_types.h | 13 ++
> kernel/locking/mcs_spinlock.h | 16 ++-
> kernel/locking/qspinlock.c | 77 +++++++++--
> kernel/locking/qspinlock_cna.h | 245 ++++++++++++++++++++++++++++++++++
> 6 files changed, 354 insertions(+), 15 deletions(-)
> create mode 100644 kernel/locking/qspinlock_cna.h
>
> --
> 2.11.0 (Apple Git-81)
>