Re: [PATCH v2 1/4] locking/qspinlock: Handle > 4 slowpath nesting levels

From: Waiman Long
Date: Wed Jan 23 2019 - 15:11:27 EST


On 01/23/2019 04:34 AM, Will Deacon wrote:
> On Tue, Jan 22, 2019 at 10:49:08PM -0500, Waiman Long wrote:
>> Four queue nodes per cpu are allocated to enable up to 4 nesting levels
>> using the per-cpu nodes. Nested NMIs are possible in some architectures.
>> Still it is very unlikely that we will ever hit more than 4 nested
>> levels with contention in the slowpath.
>>
>> When that rare condition happens, however, it is likely that the system
>> will hang or crash shortly after that. It is not good and we need to
>> handle this exception case.
>>
>> This is done by spinning directly on the lock using repeated trylock.
>> This alternative code path should only be used when there is nested
>> NMIs. Assuming that the locks used by those NMI handlers will not be
>> heavily contended, a simple TAS locking should work out.
>>
>> Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
>> ---
>> kernel/locking/qspinlock.c | 15 +++++++++++++++
>> 1 file changed, 15 insertions(+)
>>
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index 8a8c3c2..0875053 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -412,6 +412,21 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>> idx = node->count++;
>> tail = encode_tail(smp_processor_id(), idx);
> Does the compiler generate better code if we move the tail assignment
> further down, closer to the xchg_tail() call?
>
>> + /*
>> + * 4 nodes are allocated based on the assumption that there will
>> + * not be nested NMIs taking spinlocks. That may not be true in
>> + * some architectures even though the chance of needing more than
>> + * 4 nodes will still be extremely unlikely. When that happens,
>> + * we fall back to spinning on the lock directly without using
>> + * any MCS node. This is not the most elegant solution, but is
>> + * simple enough.
>> + */
>> + if (unlikely(idx >= MAX_NODES)) {
>> + while (!queued_spin_trylock(lock))
>> + cpu_relax();
>> + goto release;
>> + }
> Acked-by: Will Deacon <will.deacon@xxxxxxx>
>
> Will

Looking at the generated x86 code:

424ÂÂÂ ÂÂÂ if (unlikely(idx >= MAX_NODES)) {
ÂÂ 0x00000000000003ce <+206>:ÂÂÂ testÂÂ %ecx,%ecx
ÂÂ 0x00000000000003d0 <+208>:ÂÂÂ jgÂÂÂÂ 0x4c6
<native_queued_spin_lock_slowpath+454>

425ÂÂÂ ÂÂÂ ÂÂÂ qstat_inc(qstat_lock_no_node, true);
426ÂÂÂ ÂÂÂ ÂÂÂ while (!queued_spin_trylock(lock))

ÂÂ 0x00000000000004c2 <+450>:ÂÂÂ jneÂÂÂ 0x482
<native_queued_spin_lock_slowpath+386>
ÂÂ 0x00000000000004c4 <+452>:ÂÂÂ jmpÂÂÂ 0x491
<native_queued_spin_lock_slowpath+401>
ÂÂ 0x00000000000004c6 <+454>:ÂÂÂ incqÂÂ %gs:0x0(%rip)ÂÂÂÂÂÂÂ # 0x4ce
<native_queued_spin_lock_slowpath+462>
ÂÂ 0x00000000000004ce <+462>:ÂÂÂ movÂÂÂ $0x1,%edx
ÂÂ 0x00000000000004d3 <+467>:ÂÂÂ jmpÂÂÂ 0x4d7
<native_queued_spin_lock_slowpath+471>
ÂÂ 0x00000000000004d5 <+469>:ÂÂÂ pauseÂ
ÂÂ 0x00000000000004d7 <+471>:ÂÂÂ movÂÂÂ (%rdi),%eax
ÂÂ 0x00000000000004d9 <+473>:ÂÂÂ testÂÂ %eax,%eax
ÂÂ 0x00000000000004db <+475>:ÂÂÂ jneÂÂÂ 0x4d5
<native_queued_spin_lock_slowpath+469>
ÂÂ 0x00000000000004dd <+477>:ÂÂÂ lock cmpxchg %edx,(%rdi)
ÂÂ 0x00000000000004e1 <+481>:ÂÂÂ jneÂÂÂ 0x4d5
<native_queued_spin_lock_slowpath+469>
ÂÂ 0x00000000000004e3 <+483>:ÂÂÂ jmpÂÂÂ 0x491
<native_queued_spin_lock_slowpath+4

MAX_NODES was modified to 1 in the test kernel.

So the additional code checks the idx value and branch to the end of the
function when the condition is true. There isn't too much overhead here.

Cheers,
Longman