Re: [PATCH v3 0/9] s390: Improve this_cpu operations

From: Yang Shi

Date: Thu May 28 2026 - 14:46:14 EST

On 5/28/26 7:14 AM, Heiko Carstens wrote:

On Wed, May 27, 2026 at 04:44:31PM -0700, Yang Shi wrote:

On 5/22/26 2:18 AM, Heiko Carstens wrote:

It is amazing to see the performance improvements you see on arm64, however
I believe that is mainly because of the large amount of code which is
generated by the arm64 implementations of the preempt primitives
__preempt_count_add() and __preempt_count_dec_and_test().

Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one
instruction is used to load current pointer, the other 3 instructions are
used to RMW preempt_count). So I can remove 8 instructions in total for a
single this_cpu ops. That's a lot. Given this_cpu ops are heavily used in
kernel, we end up running fewer instructions and having better icache hit
rate, the better icache hit rate also helps reduce cross node traffic for
2-socket system.

You save more. Look at arm64's __preempt_count_dec_and_test()
implementation: it is RMW + compare + READ + compare.

Yes

preempt_enable() generates this code, where x1 seems to contain the
preempt_count pointer:

80: f9400420 ldr x0, [x1, #8]
84: d1000400 sub x0, x0, #0x1
88: b9000820 str w0, [x1, #8]
8c: b4000060 cbz x0, 98 <bar+0x58>
90: f9400420 ldr x0, [x1, #8]
94: b5000040 cbnz x0, 9c <bar+0x5c>
98: 94000000 bl 0 <preempt_schedule_notrace>
9c: ...

I assume arm64's instruction set does not allow for better code for
__preempt_count_dec_and_test() if you would fold the need_resched bit into
preempt_count and use atomic instructions + inline assembly with flag
output operands when modifying preempt_count.
As of now only x86 and s390 are doing that.

preempt_count and need_resched share the same 8 bytes. preempt_count is the lower 32 bits, need_resched is the upper 32 bits.

Atomic instruction is usually slower than load + add + store on ARM64 if the cache line is not contended. We may save one branch + load, but my profiling didn't show branch is a major contributing factor. The performance gain mainly comes from fewer instructions and icache hit rate improvement due to the elimination of preempt_disable/enable.

Thanks,
Yang