Re: [PATCH] locking/qspinlock: Optimize pending state waiting for unlock
From: Guo Ren
Date: Sun Dec 25 2022 - 07:00:56 EST
On Sun, Dec 25, 2022 at 11:31 AM Waiman Long <longman@xxxxxxxxxx> wrote:
>
> On 12/24/22 22:29, Waiman Long wrote:
> > On 12/24/22 21:57, Guo Ren wrote:
> >> On Sun, Dec 25, 2022 at 9:55 AM Waiman Long <longman@xxxxxxxxxx> wrote:
> >>> On 12/24/22 07:05, guoren@xxxxxxxxxx wrote:
> >>>> From: Guo Ren <guoren@xxxxxxxxxxxxxxxxx>
> >>>>
> >>>> When we're pending, we only care about lock value. The xchg_tail
> >>>> wouldn't affect the pending state. That means the hardware thread
> >>>> could stay in a sleep state and leaves the rest execution units'
> >>>> resources of pipeline to other hardware threads. This optimization
> >>>> may work only for SMT scenarios because the granularity between
> >>>> cores is cache-block.
> >> Please have a look at the comment I've written.
> >>
> >>>> Signed-off-by: Guo Ren <guoren@xxxxxxxxxxxxxxxxx>
> >>>> Signed-off-by: Guo Ren <guoren@xxxxxxxxxx>
> >>>> Cc: Waiman Long <longman@xxxxxxxxxx>
> >>>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >>>> Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
> >>>> Cc: Will Deacon <will@xxxxxxxxxx>
> >>>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> >>>> ---
> >>>> kernel/locking/qspinlock.c | 4 ++--
> >>>> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> >>>> index 2b23378775fe..ebe6b8ec7cb3 100644
> >>>> --- a/kernel/locking/qspinlock.c
> >>>> +++ b/kernel/locking/qspinlock.c
> >>>> @@ -371,7 +371,7 @@ void __lockfunc
> >>>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >>>> /*
> >>>> * We're pending, wait for the owner to go away.
> >>>> *
> >>>> - * 0,1,1 -> 0,1,0
> >>>> + * 0,1,1 -> *,1,0
> >>>> *
> >>>> * this wait loop must be a load-acquire such that we match the
> >>>> * store-release that clears the locked bit and create lock
> >>> Yes, we don't care about the tail.
> >>>> @@ -380,7 +380,7 @@ void __lockfunc
> >>>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >>>> * barriers.
> >>>> */
> >>>> if (val & _Q_LOCKED_MASK)
> >>>> - atomic_cond_read_acquire(&lock->val, !(VAL &
> >>>> _Q_LOCKED_MASK));
> >>>> + smp_cond_load_acquire(&lock->locked, !VAL);
> >>>>
> >>>> /*
> >>>> * take ownership and clear the pending bit.
> >>> We may save an AND operation here which may be a cycle or two. I
> >>> remember that it may be more costly to load a byte instead of an
> >>> integer
> >>> in some arches. So it doesn't seem like that much of an optimization
> >>> from my point of view.
> >> The reason is, of course, not here. See my commit comment.
> >>
> >>> I know that arm64 will enter a low power state in
> >>> this *cond_load_acquire() loop, but I believe any change in the
> >>> state of
> >>> the the lock cacheline will wake it up. So it doesn't really matter if
> >>> you are checking a byte or an int.
> >> The situation is the SMT scenarios in the same core. Not an entering
> >> low-power state situation. Of course, the granularity between cores is
> >> "cacheline", but the granularity between SMT hw threads of the same
> >> core could be "byte" which internal LSU handles. For example, when a
> >> hw-thread yields the resources of the core to other hw-threads, this
> >> patch could help the hw-thread stay in the sleep state and prevent it
> >> from being woken up by other hw-threads xchg_tail.
> >>
> >> Finally, from the software semantic view, does the patch make it more
> >> accurate? (We don't care about the tail here.)
> >
> > Thanks for the clarification.
> >
> > I am not arguing for the simplification part. I just want to clarify
> > my limited understanding of how the CPU hardware are actually dealing
> > with these conditions.
> >
> > With that, I am fine with this patch. It would be nice if you can
> > elaborate a bit more in your commit log.
> >
> > Acked-by: Waiman Long <longman@xxxxxxxxxx>
> >
> BTW, have you actually observe any performance improvement with this patch?
Not yet. I'm researching how the hardware could satisfy qspinlock
better. Here are three points I concluded:
1. Atomic forward progress guarantee: Prevent unnecessary LL/SC
retry, which may cause expensive bus transactions when crossing the
NUMA nodes.
2. Sub-word atomic primitive: Enable freedom from interference
between locked, pending, and tail.
3. Load-cond primitive: Prevent processor from wasting loop
operations for detection.
For points 2 & 3, I have a continuous proposal to add new
atomic_read_cond_mask & smp_load_cond_mask for Linux atomic primitives
[1].
[1]: https://lore.kernel.org/lkml/20221225115529.490378-1-guoren@xxxxxxxxxx/
>
> Cheers,
> Longman
>
--
Best Regards
Guo Ren