Re: [PATCH v3 0/9] s390: Improve this_cpu operations

From: Yang Shi

Date: Wed May 27 2026 - 19:44:48 EST

On 5/22/26 2:18 AM, Heiko Carstens wrote:

On Thu, May 21, 2026 at 10:47:49AM -0700, Yang Shi wrote:

As background: s390 has so called prefix pages; the first two pages of every
CPU are percpu, via a special prefixing mechanism. Parts of the pages can be
used by operating systems as percpu data area, which we use to have fast
access to e.g. the 'current' pointer, the pid, percpu_offset of the current
cpu, etc.

Helpful is also that for instructions which access memory with a base register
zero, its contents are assumed to be zero for address generation by the
hardware, regardless of its real contents. That is, the above

ag %r4,952

is the short version of

ag %r4,952(%r0)

The eight bytes at offset 952 of the current CPU's prefix page are added to
register 4. Real contents of register 0 are irrelevant for such address
generations; reducing register pressure.

Aha, I see. So the prefix pages are some special memory?

No, it is regular memory. The CPU has a special "prefix register". If
that is set to an address not equal to zero all memory accesses to the
first two pages will be transparently redirected to the 8k memory area
specified with that register.

E.g. the prefix register contains the value 0x10000. If then a memory
access to address 0x400 happens the CPU will transparently turn that
into a memory access to address 0x10400. Or in other words, that is a
small per cpu memory area mechanism provided by the architecture.

Got it.

11a8e6: c0 30 00 d0 c5 0d larl %r3,1b33300
11a8ec: b9 04 00 43 lgr %r4,%r3
11a8f0: eb 00 43 c0 00 52 mviy 960,4
11a8f6: e3 40 03 b8 00 08 ag %r4,952
11a8fc: eb 52 40 00 00 e8 laag %r5,%r2,0(%r4)
11a902: eb 00 03 c0 00 52 mviy 960,0
11a908: b9 08 00 25 agr %r2,%r5
11a90c 07 fe br %r14

...

11a920 loads 0 to the register to mark the percpu code section end, this is
not needed with percpu page table.

I guess you meant 11a902. But yes, this marks the end of the percpu code
section. Just that this is not a register, but a memory location where is
written to.

So both mviy instructions actually do memory store?

Yes.

It sounds a little bit hacky to me TBH and incur some extra overhead for
"migration detection" and fixup.

Sure, it is hacky, and the small overhead part is of course true.

Compared to the percpu page table proposal the two mviy instructions above
would go away, as well as the extra interrupt/exception overhead. Besides
that your proposal is way less hacky.

It would be great if we can compare the performance number for the two
approaches. rseq has been discussed for ARM64, but it seems too expensive
and just move the overhead to somewhere else.

I tried to implement the proposed rseq/kseq, but the required inline
assemblies resulted in code which was larger than what we have now for
s390.

Also with the current proposal I only did some quick micro benchmarks,
which resulted in 0-1% improvement, which is in the expected range.

It is amazing to see the performance improvements you see on arm64, however
I believe that is mainly because of the large amount of code which is
generated by the arm64 implementations of the preempt primitives
__preempt_count_add() and __preempt_count_dec_and_test().

Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one instruction is used to load current pointer, the other 3 instructions are used to RMW preempt_count). So I can remove 8 instructions in total for a single this_cpu ops. That's a lot. Given this_cpu ops are heavily used in kernel, we end up running fewer instructions and having better icache hit rate, the better icache hit rate also helps reduce cross node traffic for 2-socket system.

That's a big difference to s390: for both primitives the result is a single
instruction.

Yeah, I see. S390 should have the similar benefits theoretically, but may not have that significant gains.

Thanks,
Yang