Re: [PATCH v3 0/9] s390: Improve this_cpu operations

From: Yang Shi

Date: Wed May 20 2026 - 20:24:01 EST

On 5/20/26 3:34 PM, David Laight wrote:

On Wed, 20 May 2026 11:42:36 -0700
Yang Shi <yang@xxxxxxxxxxxxxxxxxxxxxx> wrote:

Hi Heiko,

Thanks for cc'ing me the patchset. Please see the below inline comments.

On 5/20/26 2:22 AM, Heiko Carstens wrote:

v3:
- Fix various typos [Juergen Christ]

- Add missing kprobe detection / handling [Sashiko [3]]
[FWIW, this made me also aware of that the current general s390 kprobes
code seems to be racy against concurrent removal of a kprobe while a
probe hit on a different CPU. But that is a different story.]

- Fix various minor findings [Sashiko [3]]

- All of this might be dropped / exchanged in future in favor of the percpu
page table approach proposed by Yang Shi [4].

Thanks for mentioning my approach. I will do some comparison with rseq
in the following design details section of the cover letter.

[3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@xxxxxxxxxxxxx
[4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@xxxxxxxxxxxxxxxxxxxxxx/

v2:

- Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
warnings

- Add missing __packed attribute to insn structure [Sashiko [2]]

- Fix inverted if condition [Sashiko [2]]

- Add missing user_mode() check [Sashiko [2]]

- Move percpu_entry() call in front of irqentry_enter() call in all
entry paths to avoid that potential this_cpu() operations overwrite
the not-yet saved percpu code section indicator [Sashiko [2]]

[2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com

v1:

This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].

With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.

In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.

To avoid this Peter suggested an in-kernel rseq approach. While this would
certainly work, this series tries to come up with a solution which uses
less instructions and doesn't require to repeat instruction sequences.

The idea is that this_cpu operations based on atomic instructions are
guarded with mvyi instructions:

- The first mvyi instruction writes the register number, which contains
the percpu address variable to lowcore. This also indicates that a
percpu code section is executed.

- The first instruction following the mvyi instruction must be the ag
instruction which adds the percpu offset to the percpu address register.

- Afterwards the atomic percpu operation follows.

- Then a second mvyi instruction writes a zero to lowcore, which indicates
the end of the percpu code section.

- In case of an interrupt/exception/nmi the register number which was
written to lowcore is copied to the exception frame (pt_regs), and a zero
is written to lowcore.

- On return to the previous context it is checked if a percpu code section
was executed (saved register number not zero), and if the process was
migrated to a different cpu. If the percpu offset was already added to
the percpu address register (instruction address does _not_ point to the
ag instruction) the content of the percpu address register is adjusted so
it points to percpu variable of the new cpu.

If I understand correctly, you replaced preempt_disable() and
preempt_enable() with seq begin and seg end, and seq begin and seq end
can be optimized by mvyi instruction on S390. So you just need a single
mvyi instruction for each instead of read-modify-write the seq count.

But you need some extra overhead for context switch (save and restore
the seq count register) and need to check whether it is still on the
same cpu once resuming execution. And there is also penalty if it is
migrated to another CPU (need to rerun this_cpu ops).

Not as I understand it.
What happens is the context switch code 'corrupts' the register being
used to access per-cpu data so that it is correct for the new cpu.
The write of zero after the sequence is there to stop the register
being corrupted outside of this code window.

Thanks for elaborating it. I misunderstood some nuance. I read the patch #2 commit message, now I think I understand how it works.

Borrowed the disassemble from patch #2 commit message:

11a8e6: c0 30 00 d0 c5 0d larl %r3,1b33300
11a8ec: b9 04 00 43 lgr %r4,%r3
11a8f0: eb 00 43 c0 00 52 mviy 960,4
11a8f6: e3 40 03 b8 00 08 ag %r4,952
11a8fc: eb 52 40 00 00 e8 laag %r5,%r2,0(%r4)
11a902: eb 00 03 c0 00 52 mviy 960,0
11a908: b9 08 00 25 agr %r2,%r5
11a90c 07 fe br %r14

11a8f0 loads the percpu offset and mark the percpu code section begin, I believe this is needed with percpu page table too because we need load local percpu offset.
11a920 loads 0 to the register to mark the percpu code section end, this is not needed with percpu page table.

And you need to save the register at the irq/exception entry, then restore it at exit. But you also need to check whether migration happens or not, if it happens kernel needs to rewrite the register with correct percpu offset and needs to check whether the interrupted instruction is "ag". If it is "ag" instruction (11a8f6) , kernel needs to recalculate the percpu address, right?

It sounds a little bit hacky to me TBH and incur some extra overhead for "migration detection" and fixup.

This really just means that you can (mostly) only do single accesses,
since nothing stops pre-emption between the RW or an RMW sequence.
Although you can probably do an increment of the preempt disable count
because if you are preempted the value read will be zero.

So it seems have more overhead than the percpu page table approach IIUC.
We don't need all the steps with percpu page table. And there is no
penalty for migration.

This code looks like it relies on 'page zero' already being percpu.
So it probably isn't really that different.
Some values like the 'preemption disable count' and 'current' could be
(maybe are?) written into page zero to give fast access.

I don't quite get what you mean about 'page zero'.

But I'm sure I remember that some cpu don't like having the same
physical address at different virtual addresses (and not just those
with VIVT caches like some sparc cpu).

Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is really percpu, so the mapping to the physical address belonging to another CPU should never pollute the current CPU's cache if I don't miss something.

I'm sure code can end up accessing the current cpu's percpu data
using the same address that other cpu use - there are definitely
places where it needs that address.

No, it is not. In the percpu page table approach, we use different address for this_cpu_*() and per_cpu_ptr() which is mainly used to initialize percpu data for all CPUs.

Thanks,
Yang

On x86-64 that means it reading the address from the array rather
than just offsetting from %gs.

-- David

All of this seems to work, but of course it could still be broken since I
missed some detail.

In total this series results in a kernel text size reduction of ~106kb. The
number of preempt_schedule_notrace() call sites is reduced from 7089 to
1577.

Yeah, both approaches can reduce the number of
preempt_schedule_notrace() call sites. And both approaches can reduce
the number of non-preemptible critical sections.

Note: this comes without any huge performance analysis, however all
microbenchmarks confirmed that the new code is at least as fast as the
old code, like expected.

I'm really interested in the benchmark number. I'm supposed percpu page
table approach should have better performance per my above analysis.

Christopher Lameter is also interested in it, cc'ed him too.

Thanks,
Yang

[1] 20260223163843.GR1282955@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Heiko Carstens (9):
s390/alternatives: Add new ALT_TYPE_PERCPU type
s390/percpu: Infrastructure for more efficient this_cpu operations
s390/percpu: Add missing do { } while (0) constructs
s390/percpu: Use new percpu code section for arch_this_cpu_add()
s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
s390/percpu: Provide arch_this_cpu_read() implementation
s390/percpu: Provide arch_this_cpu_write() implementation
s390/percpu: Remove one and two byte this_cpu operation implementation

arch/s390/boot/alternative.c | 7 +
arch/s390/include/asm/alternative.h | 5 +
arch/s390/include/asm/entry-percpu.h | 76 ++++++++
arch/s390/include/asm/lowcore.h | 3 +-
arch/s390/include/asm/percpu.h | 249 +++++++++++++++++++++------
arch/s390/include/asm/ptrace.h | 2 +
arch/s390/kernel/alternative.c | 25 ++-
arch/s390/kernel/irq.c | 26 ++-
arch/s390/kernel/nmi.c | 6 +
arch/s390/kernel/traps.c | 6 +
10 files changed, 344 insertions(+), 61 deletions(-)
create mode 100644 arch/s390/include/asm/entry-percpu.h

base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8