Re: [PATCH v3 0/9] s390: Improve this_cpu operations

From: David Laight

Date: Thu May 28 2026 - 13:14:55 EST

On Thu, 28 May 2026 16:14:41 +0200
Heiko Carstens <hca@xxxxxxxxxxxxx> wrote:

> On Wed, May 27, 2026 at 04:44:31PM -0700, Yang Shi wrote:
> > On 5/22/26 2:18 AM, Heiko Carstens wrote:
> > > It is amazing to see the performance improvements you see on arm64, however
> > > I believe that is mainly because of the large amount of code which is
> > > generated by the arm64 implementations of the preempt primitives
> > > __preempt_count_add() and __preempt_count_dec_and_test().
> >
> > Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one
> > instruction is used to load current pointer, the other 3 instructions are
> > used to RMW preempt_count). So I can remove 8 instructions in total for a
> > single this_cpu ops. That's a lot. Given this_cpu ops are heavily used in
> > kernel, we end up running fewer instructions and having better icache hit
> > rate, the better icache hit rate also helps reduce cross node traffic for
> > 2-socket system.
>
> You save more. Look at arm64's __preempt_count_dec_and_test()
> implementation: it is RMW + compare + READ + compare.
>
> preempt_enable() generates this code, where x1 seems to contain the
> preempt_count pointer:
>
> 80: f9400420 ldr x0, [x1, #8]
> 84: d1000400 sub x0, x0, #0x1
> 88: b9000820 str w0, [x1, #8]
> 8c: b4000060 cbz x0, 98 <bar+0x58>
> 90: f9400420 ldr x0, [x1, #8]
> 94: b5000040 cbnz x0, 9c <bar+0x5c>
> 98: 94000000 bl 0 <preempt_schedule_notrace>
> 9c: ...
>
> I assume arm64's instruction set does not allow for better code for
> __preempt_count_dec_and_test() if you would fold the need_resched bit into
> preempt_count and use atomic instructions + inline assembly with flag
> output operands when modifying preempt_count.
> As of now only x86 and s390 are doing that.

I think arm64 only has single instruction exchanges - which makes life hard.
But it has to be possible to do better than the above.
The 'normal' path (not nested, no preemption) seems to execute everything
except the 'bl'.
All the 'not preempted' paths have a taken forwards conditional branch
that stands a fair chance of being mispredicted.
There is also the 32bit write followed by a 64bit read of the same address.
That will 'break' any logic that does 'store to load' forwarding (where
the read is satisfied from the store buffer) and add more delays.
That means I think you need something like:
ldr w0, [x1, #8]
sub x0, x0, #1
str w0, [x1, #8]
ldr w2, [x1, #12]
or x0, x0, x2
cbz x0, 1f
2:
# sometime later.
1:
bl preempt_schedule:
b 2b

But the last arm system I wrote asm for was a strongarm!
And the book I have is from 2004.

The definition:
#define preempt_enable() \
do { \
barrier(); \
if (unlikely(preempt_count_dec_and_test())) \
__preempt_schedule(); \
} while (0)
doesn't really help.
gcc tends to ignore the unlikely() when the other path is empty
and just generates a forwards branch around the call.
Forcing it to generate both parts of the if can help.
So:
#define preempt_enable() \
do { \
barrier(); \
if (unlikely(preempt_count_dec_and_test())) \
__preempt_schedule(); \
else \
asm (""); \
} while (0)
can be enough to force a conditional branch to the call.

-- David