Re: [PATCH v4 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

From: Alex Kogan
Date: Thu Sep 19 2019 - 11:56:44 EST


>> +/*
>> + * cna_try_find_next - scan the main waiting queue looking for the first
>> + * thread running on the same NUMA node as the lock holder. If found (call it
>> + * thread T), move all threads in the main queue between the lock holder and
>> + * T to the end of the secondary queue and return T; otherwise, return NULL.
>> + *
>> + * Schematically, this may look like the following (nn stands for numa_node and
>> + * et stands for encoded_tail).
>> + *
>> + * when cna_try_find_next() is called (the secondary queue is empty):
>> + *
>> + * A+------------+ B+--------+ C+--------+ T+--------+
>> + * |mcs:next | -> |mcs:next| -> |mcs:next| -> |mcs:next| -> NULL
>> + * |mcs:locked=1| |cna:nn=0| |cna:nn=2| |cna:nn=1|
>> + * |cna:nn=1 | +--------+ +--------+ +--------+
>> + * +----------- +
>> + *
>> + * when cna_try_find_next() returns (the secondary queue contains B and C):
>> + *
>> + * A+----------------+ T+--------+
>> + * |mcs:next | -> |mcs:next| -> NULL
>> + * |mcs:locked=B.et | -+ |cna:nn=1|
>> + * |cna:nn=1 | | +--------+
>> + * +--------------- + |
>> + * |
>> + * +-> B+--------+ C+--------+
>> + * |mcs:next| -> |mcs:next|
>> + * |cna:nn=0| |cna:nn=2|
>> + * |cna:tail| -> +--------+
>> + * +--------+
>> + *
>> + * The worst case complexity of the scan is O(n), where n is the number
>> + * of current waiters. However, the fast path, which is expected to be the
>> + * common case, is O(1).
>> + */
>> +static struct mcs_spinlock *cna_try_find_next(struct mcs_spinlock *node,
>> + struct mcs_spinlock *next)
>> +{
>> + struct cna_node *cn = (struct cna_node *)node;
>> + struct cna_node *cni = (struct cna_node *)next;
>> + struct cna_node *first, *last = NULL;
>> + int my_numa_node = cn->numa_node;
>> +
>> + /* fast path: immediate successor is on the same NUMA node */
>> + if (cni->numa_node == my_numa_node)
>> + return next;
>> +
>> + /* find any next waiter on 'our' NUMA node */
>> + for (first = cni;
>> + cni && cni->numa_node != my_numa_node;
>> + last = cni, cni = (struct cna_node *)READ_ONCE(cni->mcs.next))
>> + ;
>> +
>> + /* if found, splice any skipped waiters onto the secondary queue */
>> + if (cni && last)
>> + cna_splice_tail(cn, first, last);
>> +
>> + return (struct mcs_spinlock *)cni;
>> +}
>
> At the Linux Plumbers Conference last week, Will has raised the concern
> about the latency of the O(1) cna_try_find_next() operation that will
> add to the lock hold time.
While the worst case complexity of the scan is O(n), I _think it can be proven
that the amortized complexity is O(1). For intuition, consider a two-node
system with N threads total. In the worst case scenario, the scan will go
over N/2 threads running on a different node. If the scan ultimately âfailsâ
(no thread from the lock holderâs node is found), the lock will be passed
to the first thread from a different node and then between all those N/2 threads,
with a scan of just one node for the next N/2 - 1 passes. Otherwise, those
N/2 threads will be moved to the secondary queue. On the next lock handover,
we pass the lock either to the next thread in the main queue (as it has to be
from our node) or to the first node in the secondary queue. In both cases, we
scan just one node, and in the latter case, we have again N/2 - 1 passes with
a scan of just one node each.

> One way to hide some of the latency is to do
> a pre-scan before acquiring the lock. The CNA code could override the
> pv_wait_head_or_lock() function to call cna_try_find_next() as a
> pre-scan and return 0. What do you think?
This is certainly possible, but I do not think it would completely eliminate
the worst case scenario. It will probably make it even less likely, but at
the same time, we will reduce the chance of actually finding a thread from the
same node (that may enter the main queue while we wait for the owner & pending
to go away).

Regards,
â Alex