Re: [PATCH v4 3/4] locking/qspinlock: Add ARCH_USE_QUEUED_SPINLOCKS_XCHG32

From: Arnd Bergmann
Date: Mon Mar 29 2021 - 05:42:28 EST


On Mon, Mar 29, 2021 at 9:52 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Sat, Mar 27, 2021 at 06:06:38PM +0000, guoren@xxxxxxxxxx wrote:
> > From: Guo Ren <guoren@xxxxxxxxxxxxxxxxx>
> >
> > Some architectures don't have sub-word swap atomic instruction,
> > they only have the full word's one.
> >
> > The sub-word swap only improve the performance when:
> > NR_CPUS < 16K
> > * 0- 7: locked byte
> > * 8: pending
> > * 9-15: not used
> > * 16-17: tail index
> > * 18-31: tail cpu (+1)
> >
> > The 9-15 bits are wasted to use xchg16 in xchg_tail.
> >
> > Please let architecture select xchg16/xchg32 to implement
> > xchg_tail.
>
> So I really don't like this, this pushes complexity into the generic
> code for something that's really not needed.
>
> Lots of RISC already implement sub-word atomics using word ll/sc.
> Obviously they're not sharing code like they should be :/ See for
> example arch/mips/kernel/cmpxchg.c.

That is what the previous version of the patch set did, right?

I think this v4 is nicer because the code is already there in
qspinlock.c and just gets moved around, and the implementation
is likely more efficient this way. The mips version could be made
more generic, but it is also less efficient than a simple xchg
since it requires an indirect function call plus nesting a pair of
loops instead in place of the single single ll/sc loop in the 32-bit
xchg.

I think the weakly typed xchg/cmpxchg implementation causes
more problems than it solves, and we'd be better off using
a stronger version in general, with the 8-bit and 16-bit exchanges
using separate helpers in the same way that the fixed-length
cmpxchg64 is separate already, there are only a couple of instances
for each of these in the kernel.

Unfortunately, there is roughly a 50:50 split between fixed 32-bit
and long/pointer-sized xchg/cmpxchg users in the kernel, so
making the interface completely fixed-type would add a ton of
churn. I created an experimental patch for this, but it's probably
not worth it.

> Also, I really do think doing ticket locks first is a far more sensible
> step.

I think this is the important point to look at first. From what I can
tell, most users of ticket spinlocks have moved to qspinlock over
time, but I'm not sure about the exact tradeoff, in particular on
large machines without a 16-bit xchg operation.

FWIW, the implementations in the other SMP capable
architectures are

alpha simple
arc simple+custom
arm ticket
arm64 qspinlock (formerly ticket)
csky ticket/qrwlock (formerly simple+ticket/qrwlock)
hexagon simple
ia64 ticket
mips qspinlock (formerly ticket)
openrisc qspinlock
parisc custom
powerpc simple+qspinlock (formerly simple)
riscv simple
s390 custom-queue
sh simple
sparc qspinlock
x86 qspinlock (formerly ticket+qspinlock)
xtensa qspinlock

32-bit Arm is the only relevant user of ticket locks these days, but its
hardware has a practical limit of 16 CPUs and four nodes, with most
implementations having a single CPU cluster of at most four cores.

We had the same discussion about xchg16() when Sebastian submitted
the arm32 qspinlock implementation:
https://lore.kernel.org/linux-arm-kernel/20191007214439.27891-1-sebastian@xxxxxxxxxxxxx/

Arnd