Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

From: Waiman Long
Date: Wed Jun 12 2019 - 11:10:06 EST

On 6/12/19 12:38 AM, Alex Kogan wrote:
> Hi, Wei.
>> On Jun 11, 2019, at 12:22 AM, liwei (GF) <liwei391@xxxxxxxxxx> wrote:
>> Hi Alex,
>> On 2019/3/29 23:20, Alex Kogan wrote:
>>> In CNA, spinning threads are organized in two queues, a main queue for
>>> threads running on the same node as the current lock holder, and a
>>> secondary queue for threads running on other nodes. At the unlock time,
>>> the lock holder scans the main queue looking for a thread running on
>>> the same node. If found (call it thread T), all threads in the main queue
>>> between the current lock holder and T are moved to the end of the
>>> secondary queue, and the lock is passed to T. If such T is not found, the
>>> lock is passed to the first node in the secondary queue. Finally, if the
>>> secondary queue is empty, the lock is passed to the next thread in the
>>> main queue. For more details, see
>>> Note that this variant of CNA may introduce starvation by continuously
>>> passing the lock to threads running on the same node. This issue
>>> will be addressed later in the series.
>>> Enabling CNA is controlled via a new configuration option
>>> (NUMA_AWARE_SPINLOCKS), which is enabled by default if NUMA is enabled.
>>> Signed-off-by: Alex Kogan <alex.kogan@xxxxxxxxxx>
>>> Reviewed-by: Steve Sistare <steven.sistare@xxxxxxxxxx>
>>> ---
>>> arch/x86/Kconfig | 14 +++
>>> include/asm-generic/qspinlock_types.h | 13 +++
>>> kernel/locking/mcs_spinlock.h | 10 ++
>>> kernel/locking/qspinlock.c | 29 +++++-
>>> kernel/locking/qspinlock_cna.h | 173 ++++++++++++++++++++++++++++++++++
>>> 5 files changed, 236 insertions(+), 3 deletions(-)
>>> create mode 100644 kernel/locking/qspinlock_cna.h
>> (SNIP)
>>> +
>>> +static __always_inline int get_node_index(struct mcs_spinlock *node)
>>> +{
>>> + return decode_count(node->node_and_count++);
>> When nesting level is > 4, it won't return a index >= 4 here and the numa node number
>> is changed by mistake. It will go into a wrong way instead of the following branch.
>> /*
>> * 4 nodes are allocated based on the assumption that there will
>> * not be nested NMIs taking spinlocks. That may not be true in
>> * some architectures even though the chance of needing more than
>> * 4 nodes will still be extremely unlikely. When that happens,
>> * we fall back to spinning on the lock directly without using
>> * any MCS node. This is not the most elegant solution, but is
>> * simple enough.
>> */
>> if (unlikely(idx >= MAX_NODES)) {
>> while (!queued_spin_trylock(lock))
>> cpu_relax();
>> goto release;
>> }
> Good point.
> This patch does not handle count overflows gracefully.
> It can be easily fixed by allocating more bits for the count â we donât really need 30 bits for #NUMA nodes.

Actually, the default setting uses 2 bits for 4-level nesting and 14
bits for cpu numbers. That means it can support up to 16k-1 cpus. It is
a limit that is likely to be exceeded in the foreseeable future.
qspinlock also supports an additional mode with 21 bits used for cpu
numbers. That can support up to 2M-1 cpus. However, this mode will be a
little bit slower. That is why we don't want to use more than 2 bits for
nesting as I have never see more than 2 level of nesting used in my
testing. So it is highly unlikely we will ever hit more than 4 levels. I
am not saying that it is impossible, though.