Re: [PATCH v3 0/9] s390: Improve this_cpu operations
From: Heiko Carstens
Date: Thu May 28 2026 - 10:16:29 EST
On Wed, May 27, 2026 at 04:44:31PM -0700, Yang Shi wrote:
> On 5/22/26 2:18 AM, Heiko Carstens wrote:
> > It is amazing to see the performance improvements you see on arm64, however
> > I believe that is mainly because of the large amount of code which is
> > generated by the arm64 implementations of the preempt primitives
> > __preempt_count_add() and __preempt_count_dec_and_test().
>
> Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one
> instruction is used to load current pointer, the other 3 instructions are
> used to RMW preempt_count). So I can remove 8 instructions in total for a
> single this_cpu ops. That's a lot. Given this_cpu ops are heavily used in
> kernel, we end up running fewer instructions and having better icache hit
> rate, the better icache hit rate also helps reduce cross node traffic for
> 2-socket system.
You save more. Look at arm64's __preempt_count_dec_and_test()
implementation: it is RMW + compare + READ + compare.
preempt_enable() generates this code, where x1 seems to contain the
preempt_count pointer:
80: f9400420 ldr x0, [x1, #8]
84: d1000400 sub x0, x0, #0x1
88: b9000820 str w0, [x1, #8]
8c: b4000060 cbz x0, 98 <bar+0x58>
90: f9400420 ldr x0, [x1, #8]
94: b5000040 cbnz x0, 9c <bar+0x5c>
98: 94000000 bl 0 <preempt_schedule_notrace>
9c: ...
I assume arm64's instruction set does not allow for better code for
__preempt_count_dec_and_test() if you would fold the need_resched bit into
preempt_count and use atomic instructions + inline assembly with flag
output operands when modifying preempt_count.
As of now only x86 and s390 are doing that.