Re: [PATCH v3 0/9] s390: Improve this_cpu operations

From: Yang Shi

Date: Thu May 21 2026 - 13:48:11 EST

On 5/21/26 3:37 AM, Heiko Carstens wrote:

On Wed, May 20, 2026 at 05:23:37PM -0700, Yang Shi wrote:

If I understand correctly, you replaced preempt_disable() and
preempt_enable() with seq begin and seg end, and seq begin and seq end
can be optimized by mvyi instruction on S390. So you just need a single
mvyi instruction for each instead of read-modify-write the seq count.

But you need some extra overhead for context switch (save and restore
the seq count register) and need to check whether it is still on the
same cpu once resuming execution. And there is also penalty if it is
migrated to another CPU (need to rerun this_cpu ops).

Not as I understand it.
What happens is the context switch code 'corrupts' the register being
used to access per-cpu data so that it is correct for the new cpu.
The write of zero after the sequence is there to stop the register
being corrupted outside of this code window.

Thanks for elaborating it. I misunderstood some nuance. I read the patch #2
commit message, now I think I understand how it works.

As background: s390 has so called prefix pages; the first two pages of every
CPU are percpu, via a special prefixing mechanism. Parts of the pages can be
used by operating systems as percpu data area, which we use to have fast
access to e.g. the 'current' pointer, the pid, percpu_offset of the current
cpu, etc.

Helpful is also that for instructions which access memory with a base register
zero, its contents are assumed to be zero for address generation by the
hardware, regardless of its real contents. That is, the above

ag %r4,952

is the short version of

ag %r4,952(%r0)

The eight bytes at offset 952 of the current CPU's prefix page are added to
register 4. Real contents of register 0 are irrelevant for such address
generations; reducing register pressure.

Aha, I see. So the prefix pages are some special memory?

Borrowed the disassemble from patch #2 commit message:

11a8e6: c0 30 00 d0 c5 0d larl %r3,1b33300
11a8ec: b9 04 00 43 lgr %r4,%r3
11a8f0: eb 00 43 c0 00 52 mviy 960,4
11a8f6: e3 40 03 b8 00 08 ag %r4,952
11a8fc: eb 52 40 00 00 e8 laag %r5,%r2,0(%r4)
11a902: eb 00 03 c0 00 52 mviy 960,0
11a908: b9 08 00 25 agr %r2,%r5
11a90c 07 fe br %r14

11a8f0 loads the percpu offset and mark the percpu code section begin, I
believe this is needed with percpu page table too because we need load local
percpu offset.

No, 11a8f0 _writes_ the base register number, which contains the percpu
address used by the percpu atomic op at 11a8fc, to offset 960 of the first
prefix page. It could also be written like

mviy 960(%r0),4

maybe that makes it more obvious what happens. And yes, this marks the
beginning of a percpu code section. The percpu offset is added to register r4
at 11a8f6 with the ag instruction. This could also be written like

ag %r4,952(%r0)

This reads the eight byte percpu_offset from offset 952 of the first prefix
page, and adds it to register r4.

Got it.

11a920 loads 0 to the register to mark the percpu code section end, this is
not needed with percpu page table.

I guess you meant 11a902. But yes, this marks the end of the percpu code
section. Just that this is not a register, but a memory location where is
written to.

So both mviy instructions actually do memory store?

And you need to save the register at the irq/exception entry, then restore
it at exit. But you also need to check whether migration happens or not, if
it happens kernel needs to rewrite the register with correct percpu offset
and needs to check whether the interrupted instruction is "ag".

Yes.

If it is "ag" instruction (11a8f6) , kernel needs to recalculate the percpu
address, right?

No, if it is within the percpu code section and it is _not_ the ag instruction,
the percpu base register needs to be adjusted (that's by the way a bug in
patch two, which has this logic inverted - my mistake).

Yeah, I see.

It sounds a little bit hacky to me TBH and incur some extra overhead for
"migration detection" and fixup.

Sure, it is hacky, and the small overhead part is of course true.

Compared to the percpu page table proposal the two mviy instructions above
would go away, as well as the extra interrupt/exception overhead. Besides
that your proposal is way less hacky.

It would be great if we can compare the performance number for the two approaches. rseq has been discussed for ARM64, but it seems too expensive and just move the overhead to somewhere else.

So it seems have more overhead than the percpu page table approach IIUC.
We don't need all the steps with percpu page table. And there is no
penalty for migration.

This code looks like it relies on 'page zero' already being percpu.
So it probably isn't really that different.
Some values like the 'preemption disable count' and 'current' could be
(maybe are?) written into page zero to give fast access.

I don't quite get what you mean about 'page zero'.

Hopefully the above description with prefix pages explains it?

Yes, definitely, thank you so much for elaborating it.

Regards,
Yang