Re: [RFC] in-kernel rseq

From: Mathieu Desnoyers

Date: Mon Feb 23 2026 - 13:24:01 EST


On 2026-02-23 12:53, David Laight wrote:
On Mon, 23 Feb 2026 17:38:43 +0100
Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

Hi,

It has come to my attention that various people are struggling with
preempt_disable()+preempt_enable() costs for various architectures.

Mostly in relation to things like this_cpu_ and or local_.

The below is a very crude (and broken, more on that below) POC.

So the 'main' advantage of this over preempt_disable()/preempt_enable(),
it on the preempt_enable() side, this elides the whole conditional and
call schedule() nonsense.

Now, on to the broken part, the below 'commit' address should be the
address of the 'STORE' instruction. In case of LL/SC, it should be the
SC, in case of LSE, it should be the LSE instruction.

I think it would be better as the address of the instruction after
the 'store'.

That's indeed what we do for userspace rseq.

You probably don't need separate 'begin' and 'restart' addresses.

It's not needed as long as the abort behavior is only restart. It
becomes useful if another behavior is wanted on abort. But since
this is kernel code and not ABI, it can change if the need arise.

It might be enough to save the 'restart' address and a byte length
directly in 'current', much simpler code.

That would make it two stores to the task struct. Those would not be
single-instruction, so we'd have to deal with preemption coming between
those two stores. Also this would be more code: two stores compared
to a single pointer store to the task struct to begin the critical
section. AFAIU Peter's proposed approach is more efficient.

We could turn the end address into a length if we want, this would
make it more alike the userspace rseq ABI counterpart.


How much it helps is another matter.
I'm sure I remember something about per-cpu data being used for something
because it was faster then using 'current' - not sure of the context.

The problem with per-cpu data for this is how to handle migration ?
The whole point of this is to replace preempt disable.


The real problem with rseq is they don't scale.

Not sure what you mean. They don't scale with respect to what ?

At least this against the context switch code - which a slow path.

This adds a task struct field load + NULL check on the scheduler
fast path. Is it what you are concerned about ?

[...]
I think that is just unlocked RMW of a per-cpu/thread variable.
It is quite similar to LL/SC, but mitigated by the scheduler rather than
hardware, so it can use a sequence of cheaper load/store instructions on
the fast path. Also, based on prior benchmarks, a short sequence of
loads/stores was faster than a unlocked RMW instruction (at least on
x86-64).

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com