[RFC PATCH v3 0/7]

From: Fedorov Nikita

Date: Wed Apr 15 2026 - 13:10:16 EST

Changes since v2:
- split the patch into a smaller patch series,
though it is still hard to split the hqlock internal logic itself
- remove some unused code
- rebased onto Linux v7.0
- allocate hq-spinlock metadata with kvmalloc instead of memblock
- added contention detection and verified we have no performance degradation in low contention scenario

[Motivation]

In a high contention case, existing Linux kernel spinlock implementations can become
inefficient on modern NUMA systems due to frequent and expensive
cross-NUMA cacheline transfers.

This might happen due to following reasons:
- on "contender enqueue" each lock contender updates a shared lock structure
- on "MCS handoff" cross-NUMA cache-line transfer occurs when
two contenders are from different NUMA nodes.

Previous work regarding NUMA-aware spinlock in Linux kernel is CNA-lock:
https://lore.kernel.org/lkml/20210514200743.3026725-1-alex.kogan@xxxxxxxxxx/

It reduces cross-NUMA cacheline traffic during handoff, but does not reduce it during enqueuing.
CNA design also requires the first contender to do additional work during global spinning
and keeps threads from all nodes other than the first one in the single secondary queue.
In our measurements, we only saw benefits from using it on Kunpeng;
on x86 platforms, CNA behaved the same as a regular qspinlock.
Thus, there is still quite a lot of potential for optimization.

HQ-lock has completely different design concept: kind of cohort-lock and
queued-spinlock hybrid.

If someone wants to try the HQ-lock in some subsystem, just
change lock initialization code from `spin_lock_init()` to `spin_lock_init_hq()`,
or change `DEFINE_SPINLOCK()` macro to `DEFINE_SPINLOCK_HQ()` if the lock is static.
The dedicated bit in the lock structure is used to distiguish between the two lock types.

[Performance measurements]

Performance measurements were done on x86 (AMD EPYC) and arm64 (Kunpeng 920)
platforms with the following scenarious:
- Locktorture benchmark
- Memcached + memtier benchmark
- Ngnix + Wrk benchmark

[Locktorture]

NPS stands for "Nodes per socket"
+------------------------------+-----------------------+-------+-------+--------+
| AMD EPYC 9654 |
+------------------------------+-----------------------+-------+-------+--------+
| 192 cores (x2 hyper-threads) | | | | |
| 2 sockets | | | | |
| Locktorture 60 sec. | NUMA nodes per-socket | | | |
| Average gain (single lock) | 1 NPS | 2 NPS | 4 NPS | 12 NPS |
| Total contender threads | | | | |
| 8 | 19% | 21% | 12% | 12% |
| 16 | 13% | 18% | 34% | 75% |
| 32 | 8% | 14% | 25% | 112% |
| 64 | 11% | 12% | 30% | 152% |
| 128 | 9% | 17% | 37% | 163% |
| 256 | 2% | 16% | 40% | 168% |
| 384 | -1% | 14% | 44% | 186% |
+------------------------------+-----------------------+-------+-------+--------+

+-----------------+-------+-------+-------+--------+
| Fairness factor | 1 NPS | 2 NPS | 4 NPS | 12 NPS |
+-----------------+-------+-------+-------+--------+
| 8 | 0.54 | 0.57 | 0.57 | 0.55 |
| 16 | 0.52 | 0.53 | 0.60 | 0.58 |
| 32 | 0.53 | 0.53 | 0.53 | 0.61 |
| 64 | 0.52 | 0.56 | 0.54 | 0.56 |
| 128 | 0.51 | 0.54 | 0.54 | 0.53 |
| 256 | 0.52 | 0.52 | 0.52 | 0.52 |
| 384 | 0.51 | 0.51 | 0.51 | 0.51 |
+-----------------+-------+-------+-------+--------+

+-------------------------+--------------+
| Kunpeng 920 (arm64) | |
+-------------------------+--------------+
| 96 cores (no MT) | |
| 2 sockets, 4 NUMA nodes | |
| Locktorture 60 sec. | |
| | |
| Total contender threads | Average gain |
| 8 | 93% |
| 16 | 142% |
| 32 | 129% |
| 64 | 152% |
| 96 | 158% |
+-------------------------+--------------+

[Memcached]

+---------------------------------+-----------------+-------------------+
| AMD EPYC 9654 | | |
+---------------------------------+-----------------+-------------------+
| 192 cores (x2 hyper-threads) | | |
| 2 sockets, NPS=4 | | |
| | | |
| Memtier+memcached 1:1 R/W ratio | | |
| Workers | Throughput gain | Latency reduction |
| 32 | 1% | -1% |
| 64 | 1% | -1% |
| 128 | 3% | -4% |
| 256 | 7% | -6% |
| 384 | 10% | -8% |
+---------------------------------+-----------------+-------------------+

+---------------------------------+-----------------+-------------------+
| Kunpeng 920 (arm64) | | |
+---------------------------------+-----------------+-------------------+
| 96 cores (no MT) | | |
| 2 sockets, 4 NUMA nodes | | |
| | | |
| Memtier+memcached 1:1 R/W ratio | | |
| Workers | Throughput gain | Latency reduction |
| 32 | 4% | -3% |
| 64 | 6% | -6% |
| 80 | 8% | -7% |
| 96 | 8% | -8% |
+---------------------------------+-----------------+-------------------+

[Nginx]

+-----------------------------------------------------------------------+-----------------+
| Kunpeng 920 (arm64) | |
+-----------------------------------------------------------------------+-----------------+
| 96 cores (no MT) | |
| 2 sockets, 4 NUMA nodes | |
| | |
| Nginx + WRK benchmark, single file (lockref spinlock contention case) | |
| Workers | Throughput gain |
| 32 | 1% |
| 64 | 68% |
| 80 | 72% |
| 96 | 78% |
+-----------------------------------------------------------------------+-----------------+
Despite, the test is a single-file test, it can be related to real-life cases, when some
html-pages are accessed much more frequently than others (index.html, etc.)

[Low contention remarks]
After adding contention detection scheme, we do not see performance degradation in low contention scenario (< 8 threads),
throughput of HQspinlock is equal to qspinlock,
while still having practically the same improvement in high contention case as mentioned above.

Previous version:
https://lore.kernel.org/lkml/20251206062106.2109014-1-stepanov.anatoly@xxxxxxxxxx/

Anatoly Stepanov (7):
kernel: add hq-spinlock types
hq-spinlock: implement inner logic
hq-spinlock: add contention detection
hq-spinlock: add hq-spinlock tunables and debug statistics
kernel: introduce general hq-spinlock support
lockref: use hq-spinlock
futex: use hq-spinlock for hash buckets

arch/arm64/include/asm/qspinlock.h | 37 +
arch/x86/include/asm/hq-spinlock.h | 34 +
arch/x86/include/asm/paravirt-spinlock.h | 3 +-
arch/x86/include/asm/qspinlock.h | 6 +-
include/asm-generic/qspinlock.h | 23 +-
include/asm-generic/qspinlock_types.h | 44 +-
include/linux/lockref.h | 2 +-
include/linux/spinlock.h | 26 +
include/linux/spinlock_types.h | 26 +
include/linux/spinlock_types_raw.h | 20 +
kernel/Kconfig.locks | 29 +
kernel/futex/core.c | 2 +-
kernel/locking/hqlock_core.h | 850 +++++++++++++++++++++++
kernel/locking/hqlock_meta.h | 487 +++++++++++++
kernel/locking/hqlock_proc.h | 164 +++++
kernel/locking/hqlock_types.h | 122 ++++
kernel/locking/qspinlock.c | 65 +-
kernel/locking/qspinlock.h | 4 +-
kernel/locking/spinlock_debug.c | 20 +
mm/mempolicy.c | 4 +
20 files changed, 1939 insertions(+), 29 deletions(-)
create mode 100644 arch/arm64/include/asm/qspinlock.h
create mode 100644 arch/x86/include/asm/hq-spinlock.h
create mode 100644 kernel/locking/hqlock_core.h
create mode 100644 kernel/locking/hqlock_meta.h
create mode 100644 kernel/locking/hqlock_proc.h
create mode 100644 kernel/locking/hqlock_types.h

--
2.34.1