Re: [RFC] in-kernel rseq

From: Heiko Carstens

Date: Tue Feb 24 2026 - 06:19:24 EST

On Mon, Feb 23, 2026 at 05:38:43PM +0100, Peter Zijlstra wrote:
> This means, it needs to be woven into the asm... and I'm not that handy
> with arm64 asm.
>
> The pseudo code would be something like:
>
> current->sched_seq = &_R;
> ...
>
> _start: compute per cpu-addr
> load addr
> $OP
> _commit: store addr
>
> ...
> current->sched_rseq = NULL;
>
>
> Then when preemption happens (from interrupt), the instruction pointer
> is 'simply' reset to _start and it tries again.

I guess also on every interrupt, exception, and nmi current->sched_rseq needs
to be saved on entry, and restored on exit, since other contexts can make use
of this_cpu ops as well.

> Anyway, this was aimed at arm64, which chose to use atomics for
> this_cpu. But if we move sched_rseq() from schedule-tail into interrupt
> entry, then this would also work for things like Power.

Let's assume s390 would be target, which also uses atomics for
this_cpu ops. A very simple function like:

static DEFINE_PER_CPU(long, bar);

long foo(long val)
{
return this_cpu_add_return(bar, val);
}

would turn into the below with PREEMPT_NONE:

0000000000000000 <foo>:
0: c0 04 00 00 00 00 jgnop 0 <foo>
6: c0 10 00 00 00 00 larl %r1,6 <foo+0x6> <- r1 contains address of "bar"
8: R_390_PC32DBL .data..percpu+0x2
c: a7 39 00 00 lghi %r3,0
10: e3 10 33 b8 00 08 ag %r1,952(%r3) <- add per-cpu offset
16: eb 02 10 00 00 e8 laag %r0,%r2,0(%r1) <- atomic op
1c: b9 08 00 20 agr %r2,%r0
20: 07 fe br %r14

With PREEMPT_LAZY this turns into:

0000000000000000 <foo>:
0: c0 04 00 00 00 00 jgnop 0 <foo>
6: eb af f0 68 00 24 stmg %r10,%r15,104(%r15)
c: b9 04 00 ef lgr %r14,%r15
10: b9 04 00 b2 lgr %r11,%r2
14: e3 f0 ff c8 ff 71 lay %r15,-56(%r15)
1a: e3 e0 f0 98 00 24 stg %r14,152(%r15) <- up to here: create stack frame
20: eb 01 03 a8 00 6a asi 936,1 <- preempt_inc()
26: c0 10 00 00 00 00 larl %r1,26 <foo+0x26>
28: R_390_PC32DBL .data..percpu+0x2
2c: a7 29 00 00 lghi %r2,0
30: e3 10 23 b8 00 08 ag %r1,952(%r2)
36: eb ab 10 00 00 e8 laag %r10,%r11,0(%r1)
3c: eb ff 03 a8 00 6e alsi 936,-1 <- preempt_dec_and_test()
42: a7 54 00 05 jnhe 4c <foo+0x4c>
46: c0 e5 00 00 00 00 brasl %r14,46 <foo+0x46>
48: R_390_PLT32DBL preempt_schedule_notrace+0x2
4c: b9 e8 b0 2a agrk %r2,%r10,%r11
50: eb af f0 a0 00 04 lmg %r10,%r15,160(%r15)
56: 07 fe br %r14

With your proposal I guess this would turn into something like below. Note,
the below is hand-edited, therefore offsets etc, do not make any sense, it is
just the instruction sequence I guess we _could_ end up with:

0000000000000000 <foo>:
0: c0 04 00 00 00 00 jgnop 0 <foo>
larl %r1,#this_seq <- &_RR
stg %r1,944 <- lowcore->sched_seq = &_R;
c: c0 10 00 00 00 00 larl %r1,c <foo+0xc>
e: R_390_PC32DBL .data..percpu+0x2
16: e3 10 33 b8 00 08 ag %r1,952
1c: eb 02 10 00 00 e8 laag %r0,%r2,0(%r1)
mvghi 944,0 <- lowcore->sched_seq = NULL;
2c: b9 08 00 20 agr %r2,%r0
30: 07 fe br %r14

This uses the s390 specific "lowcore" instead of current for sched_seq, since
it is an architecture per-cpu area mapped at address zero.

Let me give it a try to verify if the generated code would really look
like the above, but might a few days.