Re: [PATCH v3 next 0/5] locking/osq_lock: Optimisations to osq_lock code

From: David Laight

Date: Fri Mar 06 2026 - 17:59:59 EST

On Fri, 6 Mar 2026 22:51:45 +0000
david.laight.linux@xxxxxxxxx wrote:

Apologies to Yafang for mistyping his address....

> From: David Laight <david.laight.linux@xxxxxxxxx>
>
> This is a slightly edited copy of v2 from 2 years ago.
> I've re-read the comments (on v1 and v2).
> Patch #3 now unconditionally calls decode_cpu() when stabilizing @prev
> (I'm not at all sure the cpu number can ever be unchanged.)
> Patch #5 now converts almost all the cpu numbers to 'unsigned int'.
>
> Fot patch #2 I've found a note that:
> kernel test robot noticed a 10.7% improvement of stress-ng.netlink-task.ops_per_sec
>
> Notes from v2:
> Patch #1 is the node->locked part of v1's patch #2.
>
> Patch #2 removes the pretty much guaranteed cache line reload getting
> the cpu number (from node->prev) for the vcpu_is_preempted() check.
> It is (basically) the old #5 with the addition of a READ_ONCE()
> and leaving the '+ 1' offset (for patch 3).
>
> Patch #3 ends up removing both node->cpu and node->prev.
> This saves issues initialising node->cpu.
> Basically node->cpu was only ever read as node->prev->cpu in the unqueue code.
> Most of the time it is the value read from lock->tail that was used to
> obtain 'prev' in the first place.
> The only time it is different is in the unlock race path where 'prev'
> is re-read from node->prev - updated right at the bottom of osq_lock().
> So the updated node->prev_cpu can used (and prev obtained from it) without
> worrying about only one of node->prev and node->prev-cpu being updated.
>
> Linus did suggest just saving the cpu numbers instead of pointers.
> It actually works for 'prev' but not 'next'.
>
> Patch #4 removes the unnecessary node->next = NULL
> assignment from the top of osq_lock().
>
> Patch #5 just stops gcc using two separate instructions to decrement
> the offset cpu number and then convert it to 64 bits.
> Linus got annoyed with it, and I'd spotted it as well.
> I don't seem to be able to get gcc to convert __per_cpu_offset[cpu - 1]
> to (__per_cpu_offset - 1)[cpu] (cpu is offset by one) but, in any case,
> it would still need zero extending in the common case.
>
> David Laight (5):
> Defer clearing node->locked until the slow osq_lock() path.
> Optimise vcpu_is_preempted() check.
> Use node->prev_cpu instead of saving node->prev.
> Optimise decode_cpu() and per_cpu_ptr().
> Avoid writing to node->next in the osq_lock() fast path.
>
> kernel/locking/osq_lock.c | 56 +++++++++++++++++++--------------------
> 1 file changed, 27 insertions(+), 29 deletions(-)
>