Re: [PATCH v3 0/9] s390: Improve this_cpu operations

From: David Laight

Date: Thu May 28 2026 - 16:41:04 EST


On Thu, 28 May 2026 12:19:43 -0700
Yang Shi <yang@xxxxxxxxxxxxxxxxxxxxxx> wrote:

> On 5/28/26 2:03 AM, David Laight wrote:
> > On Wed, 27 May 2026 16:44:31 -0700
> > Yang Shi <yang@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> >> On 5/22/26 2:18 AM, Heiko Carstens wrote:
> > ...
> >>> It is amazing to see the performance improvements you see on arm64, however
> >>> I believe that is mainly because of the large amount of code which is
> >>> generated by the arm64 implementations of the preempt primitives
> >>> __preempt_count_add() and __preempt_count_dec_and_test().
> >> Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one
> >> instruction is used to load current pointer, the other 3 instructions
> >> are used to RMW preempt_count). So I can remove 8 instructions in total
> >> for a single this_cpu ops. That's a lot. Given this_cpu ops are heavily
> >> used in kernel, we end up running fewer instructions and having better
> >> icache hit rate, the better icache hit rate also helps reduce cross node
> >> traffic for 2-socket system.
> > Is 'current' kept in a cpu hardware register?
>
> Yes, sp_el0. But it is a special register, we need move it to a general
> register before any ARM64 instructions can access it.

That is what I thought.
(Hmm... isn't that the userspace stack register?)

>
> > With the process switch code updating current->per_cpu_data.
> >
> > That might mean that you can access per-cpu data without disabling
> > preemption (for single ops) using the same technique as s390.
> > So something like:
> > mov %ra, current
> > movb per_cpu_reg(%ra), $b
> > mov %rb, per_cpu_data(%ra)
> > // per-cpu access using %rb, process switch code will update %rb
> > movb per_cpu_reg(%ra), $255
> >
> > An add will need to use a cmpxchg loop.
> > For simplicity use a fixed register for %rb.
>
> TBH, I can't say I fully understand what you proposed. But it sounds
> like this one
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?id=84ee5f23f93d4a650e828f831da9ed29c54623c5

Not really, although it does describe one way to do an atomic add.
For things like per-cpu stats you don't really care if the
'wrong' stats are changed, but the R and W (of the RMW) need to go to the
same address.

That proposal reserved a 'general register' for the per-cpu data all the time.

Like the s390 code this all started with, I'm suggesting that the code
tells the context switch code that a specific register contains the base
of the per-cpu data, on context switch that register is changed to be the
base address of the per-cpu data for the new cpu.
So outside of the code accessing per-cpu data the register can be used normally.

I don't think you need to look at the opcode if the process switch (the s390
code did), even checking that %rb (above) contains the per-cpu data address
is really optional.

I suggested using a fixed register meaning 'always use the same register'
to save the difficultly of generating $n from %rn.

-- David






>
> Thanks,
> Yang
>
> >
> > -- David
>