Re: A new spinlock for multicore (>16) platform

From: Shi-Wu, Lo(Gmail)
Date: Mon Jul 15 2024 - 06:01:46 EST


Here are my explanations for the two issues you mentioned:

1) Each lock requires a routing table of size num_possible_cpus() for
the routing table. There can be thousands of spinlocks in the kernel and
it may be problematic to manage this extra memory.

We have implemented this method in the Linux kernel.
(https://github.com/shiwulo/ron-osdi2023/blob/main/qspinlock.patch)
All locks share a single routing table, and all locks share a single
waiting array. This Linux kernel has been running for over a year
without any issues. Currently, we have implemented this method on the
AMD 2990WX (32 cores) and AMD 3995WX (64 cores).

This is because context switches cannot occur when the Linux kernel is
using spinlocks, so we can allow all cores to share a waiting array.
The space complexity of RON executed in the Linux kernel is O(#core),
which is the same as the space complexity of qspinlock. Since the
Linux kernel operates on all cores of a processor, we can allow all
spinlocks to share a routing table.

2) Beside lock contention performance, the Linux kernel also has to
provide good uncontended spinlock performance. The unlock code currently
scan the routing table to find the next one to allocate the lock to
which may slow down the uncontended spinlock performance.

We use a bit array to speed up the unlock execution. The nth bit
represents that the nth core wants to enter the critical section.
Therefore, the next core to enter the critical section can be quickly
determined using __ffs or __cnz. (page 29 in the document.
https://www.cs.ccu.edu.tw/~shiwulo/nxt+sh-RON.pdf) The new method's
(nxtRON) unlock time complexity is O(1). Under low contention, nxtRON
performs almost identically to GNU's pthread_spin_lock (TTAS).
Similarly, under low contention, qspinlock's performance is nearly the
same as GNU's pthread_spin_lock.

As shown in the pdf file (slides 56 to 59,
https://www.cs.ccu.edu.tw/~shiwulo/nxt+sh-RON.pdf), there can be a
3.8% improvement for the levelDB application.



Furthermore, the issues with multi-core (#core >16) processors and
NUMA system are different. NUMA-aware spinlocks assume that data
transfer between processors is slow, while data transfer within the
same processor is fast. Based on this assumption, NUMA-aware spinlocks
allow tasks on the same processor to enter the critical section before
allowing tasks on other cores to do so.

This approach (NUMA-aware spinlocks) has two main drawbacks. The first
drawback is the issue of fairness. The second drawback is that
NUMA-aware spinlocks do not optimize data transfer within the same
processor. The RON algorithm adheres to bounded waiting and can
optimize for multi-core processors.