Re: [PATCH v3 0/9] s390: Improve this_cpu operations

From: David Laight

Date: Thu May 28 2026 - 05:17:08 EST


On Wed, 27 May 2026 16:44:31 -0700
Yang Shi <yang@xxxxxxxxxxxxxxxxxxxxxx> wrote:

> On 5/22/26 2:18 AM, Heiko Carstens wrote:
...
> > It is amazing to see the performance improvements you see on arm64, however
> > I believe that is mainly because of the large amount of code which is
> > generated by the arm64 implementations of the preempt primitives
> > __preempt_count_add() and __preempt_count_dec_and_test().
>
> Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one
> instruction is used to load current pointer, the other 3 instructions
> are used to RMW preempt_count). So I can remove 8 instructions in total
> for a single this_cpu ops. That's a lot. Given this_cpu ops are heavily
> used in kernel, we end up running fewer instructions and having better
> icache hit rate, the better icache hit rate also helps reduce cross node
> traffic for 2-socket system.

Is 'current' kept in a cpu hardware register?
With the process switch code updating current->per_cpu_data.

That might mean that you can access per-cpu data without disabling
preemption (for single ops) using the same technique as s390.
So something like:
mov %ra, current
movb per_cpu_reg(%ra), $b
mov %rb, per_cpu_data(%ra)
// per-cpu access using %rb, process switch code will update %rb
movb per_cpu_reg(%ra), $255

An add will need to use a cmpxchg loop.
For simplicity use a fixed register for %rb.

-- David