Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

From: Alex Kogan
Date: Mon Aug 31 2020 - 17:40:37 EST



> On Jul 28, 2020, at 4:00 PM, Waiman Long <longman@xxxxxxxxxx> wrote:
>
> On 4/3/20 4:59 PM, Alex Kogan wrote:
>> In CNA, spinning threads are organized in two queues, a primary queue for
>> threads running on the same node as the current lock holder, and a
>> secondary queue for threads running on other nodes. After acquiring the
>> MCS lock and before acquiring the spinlock, the lock holder scans the
>> primary queue looking for a thread running on the same node (pre-scan). If
>> found (call it thread T), all threads in the primary queue between the
>> current lock holder and T are moved to the end of the secondary queue.
>> If such T is not found, we make another scan of the primary queue when
>> unlocking the MCS lock (post-scan), starting at the position where
>> pre-scan stopped. If both scans fail to find such T, the MCS lock is
>> passed to the first thread in the secondary queue. If the secondary queue
>> is empty, the lock is passed to the next thread in the primary queue.
>> For more details, see https://urldefense.com/v3/__https://arxiv.org/abs/1810.05600__;!!GqivPVa7Brio!OaieLQ3MMZThgxr-Of8i9dbN5CwG8mXSIBJ_sUofhAXcs43IWL2x-stO-XKLEebn$ .
>>
>> Note that this variant of CNA may introduce starvation by continuously
>> passing the lock to threads running on the same node. This issue
>> will be addressed later in the series.
>>
>> Enabling CNA is controlled via a new configuration option
>> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
>> boot time only if we run on a multi-node machine in native environment and
>> the new config is enabled. (For the time being, the patching requires
>> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
>> resolved once static_call() is available.) This default behavior can be
>> overridden with the new kernel boot command-line option
>> "numa_spinlock=on/off" (default is "auto").
>>
>> Signed-off-by: Alex Kogan <alex.kogan@xxxxxxxxxx>
>> Reviewed-by: Steve Sistare <steven.sistare@xxxxxxxxxx>
>> Reviewed-by: Waiman Long <longman@xxxxxxxxxx>
>> ---
>
> There is also a concern that the worst case latency for a lock transfer can be close to O(n) which can be quite large for large SMP systems. I have a patch on top that modifies the current behavior to limit the number of node scans after the lock is freed.

I understand the concern. While your patch addresses it, I am afraid it makes
the code somewhat more complex, and duplicates some of the slow path
functionality (e.g., the spin loop until the lock value changes to a certain
value).

Let me suggest a different idea for gradually restructuring the main queue
that has some similarity to the way you suggested to handle prioritized waiters.
Basically, instead of scanning the entire chain of main queue waiters,
we can check only the next waiter and, if present and it runs on a different
node, move it to the secondary queue. In addition, to maintain the preference
for a certain numa node ID, we set the numa node of the next-next waiter,
if present, to that of the current lock holder. This is the part similar to the
way you suggested to handle prioritized waiters.

This way, the worst case complexity of cna_order_queue() decreases from O(n)
down to O(1), as we always “scan" only one waiter. And as before, we change
the preference (and flush the secondary queue) after X handovers (or after
Y ms, as in your in other patch).

I attach the patch that applies on top of your patch for prioritized nodes
(0006), but does not include your patch 0007 (time based threshold),
which I will integrate into the series in the next revision.

Please, let me know what you think.

Thanks,
— Alex

Attachment: 0007-locking-qspinlock-Implement-incremental-culling-in-C.patch
Description: Binary data


>
> Cheers,
> Longman
>
>
> <0008-locking-qspinlock-Limit-CNA-node-scan-after-the-lock.patch>