Re: [RFC] in-kernel rseq

From: Peter Zijlstra

Date: Tue Feb 24 2026 - 10:22:49 EST

On Tue, Feb 24, 2026 at 12:16:46PM +0100, Heiko Carstens wrote:

> Let's assume s390 would be target, which also uses atomics for
> this_cpu ops. A very simple function like:
>
> static DEFINE_PER_CPU(long, bar);
>
> long foo(long val)
> {
> return this_cpu_add_return(bar, val);
> }
>
> would turn into the below with PREEMPT_NONE:
>
> 0000000000000000 <foo>:
> 0: c0 04 00 00 00 00 jgnop 0 <foo>
> 6: c0 10 00 00 00 00 larl %r1,6 <foo+0x6> <- r1 contains address of "bar"
> 8: R_390_PC32DBL .data..percpu+0x2
> c: a7 39 00 00 lghi %r3,0
> 10: e3 10 33 b8 00 08 ag %r1,952(%r3) <- add per-cpu offset
> 16: eb 02 10 00 00 e8 laag %r0,%r2,0(%r1) <- atomic op
> 1c: b9 08 00 20 agr %r2,%r0
> 20: 07 fe br %r14
>
> With PREEMPT_LAZY this turns into:
>
> 0000000000000000 <foo>:
> 0: c0 04 00 00 00 00 jgnop 0 <foo>
> 6: eb af f0 68 00 24 stmg %r10,%r15,104(%r15)
> c: b9 04 00 ef lgr %r14,%r15
> 10: b9 04 00 b2 lgr %r11,%r2
> 14: e3 f0 ff c8 ff 71 lay %r15,-56(%r15)
> 1a: e3 e0 f0 98 00 24 stg %r14,152(%r15) <- up to here: create stack frame

So some of that could be elided with that asm call thunk thing we talked
about yesterday, right?

> 20: eb 01 03 a8 00 6a asi 936,1 <- preempt_inc()
> 26: c0 10 00 00 00 00 larl %r1,26 <foo+0x26>
> 28: R_390_PC32DBL .data..percpu+0x2
> 2c: a7 29 00 00 lghi %r2,0
> 30: e3 10 23 b8 00 08 ag %r1,952(%r2)
> 36: eb ab 10 00 00 e8 laag %r10,%r11,0(%r1)
> 3c: eb ff 03 a8 00 6e alsi 936,-1 <- preempt_dec_and_test()
> 42: a7 54 00 05 jnhe 4c <foo+0x4c>
> 46: c0 e5 00 00 00 00 brasl %r14,46 <foo+0x46>
> 48: R_390_PLT32DBL preempt_schedule_notrace+0x2
> 4c: b9 e8 b0 2a agrk %r2,%r10,%r11
> 50: eb af f0 a0 00 04 lmg %r10,%r15,160(%r15)
> 56: 07 fe br %r14
>
> With your proposal I guess this would turn into something like below. Note,
> the below is hand-edited, therefore offsets etc, do not make any sense, it is
> just the instruction sequence I guess we _could_ end up with:
>
> 0000000000000000 <foo>:
> 0: c0 04 00 00 00 00 jgnop 0 <foo>
> larl %r1,#this_seq <- &_RR
> stg %r1,944 <- lowcore->sched_seq = &_R;
> c: c0 10 00 00 00 00 larl %r1,c <foo+0xc>
> e: R_390_PC32DBL .data..percpu+0x2
> 16: e3 10 33 b8 00 08 ag %r1,952
> 1c: eb 02 10 00 00 e8 laag %r0,%r2,0(%r1)
> mvghi 944,0 <- lowcore->sched_seq = NULL;
> 2c: b9 08 00 20 agr %r2,%r0
> 30: 07 fe br %r14
>
> This uses the s390 specific "lowcore" instead of current for sched_seq, since
> it is an architecture per-cpu area mapped at address zero.

Right, something like that. This is hopefully 'better' :-)