Re: [patch 09/10] rseq: Reenable performance optimizations conditionally
From: Dmitry Vyukov
Date: Wed Apr 29 2026 - 05:42:53 EST
On Wed, 29 Apr 2026 at 01:34, Thomas Gleixner <tglx@xxxxxxxxxx> wrote:
>
> Due to the incompatibility with TCMalloc the RSEQ optimizations and
> extended features (time slice extensions) have been disabled and made
> run-time conditional.
>
> The original RSEQ implementation, which TCMalloc depends on, registers a 32
> byte region (ORIG_RSEG_SIZE). This region has a 32 byte alignment
> requirement.
>
> The extension safe newer variant exposes the kernel RSEQ feature size via
> getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via
> getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered
> RSEQ region is aligned to the next power of two of the feature size. The
> kernel currently has a feature size of 33 bytes, which means the alignment
> requirement is 64 bytes.
>
> The TCMalloc RSEQ region is embedded into a cache line aligned data
> structure starting at offset 32 bytes so that bytes 28-31 and the
> cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with
> the top-most bit (63 set) to check whether the kernel has overwritten
> cpu_id_start with an actual CPU id value, which is guaranteed to not have
> the top most bit set.
>
> As this is part of their performance tuned magic, it's a pretty safe
> assumption, that TCMalloc won't use a larger RSEQ size.
>
> This allows the kernel to declare that registrations with a size greater
> than the original size of 32 bytes, which is the cases since time slice
> extensions got introduced, as RSEQ ABI v2 with the following differences to
> the original behaviour:
>
> 1) Unconditional updates of the user read only fields (CPU, node, MMCID)
> are removed. Those fields are only updated on registration, task
> migration and MMCID changes.
>
> 2) Unconditional evaluation of the criticial section pointer is
> removed. It's only evaluated when user space was interrupted and was
> scheduled out or before delivering a signal in the interrupted
> context.
>
> 3) The read/only requirement of the ID fields is enforced. When the
> kernel detects that userspace manipulated the fields, the process is
> terminated. This ensures that multiple entities (libraries) can
> utilize RSEQ without interfering.
>
> 4) Todays extended RSEQ feature (time slice extensions) and future
> extensions are only enabled in the v2 enabled mode.
>
> Registrations with the original size of 32 bytes operate in backwards
> compatible legacy mode without performance improvements and extended
> features.
>
> Unfortunately that also affects users of older GLIBC versions which
> register the original size of 32 bytes and do not evaluate the kernel
> required size in the auxiliary vector AT_RSEQ_FEATURE_SIZE.
>
> That's the result of the lack of enforcement in the original implementation
> and the unwillingness of a single entity to cooperate with the larger
> ecosystem for many years.
>
> Implement the required registration changes by restructuring the spaghetti
> code and adding the size/version check. Also add documentation about the
> differences of legacy and optimized RSEQ V2 mode.
>
> Thanks to Mathieu for pointing out the ORIG_RSEQ_SIZE constraints!
>
> Fixes: d6200245c75e ("rseq: Allow registering RSEQ with slice extension")
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
Reviewed-by: Dmitry Vyukov <dvyukov@xxxxxxxxxx>
> ---
> Documentation/userspace-api/rseq.rst | 94 ++++++++++++++++++++++
> kernel/rseq.c | 144 ++++++++++++++++++++---------------
> 2 files changed, 178 insertions(+), 60 deletions(-)
>
> --- a/Documentation/userspace-api/rseq.rst
> +++ b/Documentation/userspace-api/rseq.rst
> @@ -24,6 +24,97 @@ Quick access to CPU number, node ID
> Allows to implement per CPU data efficiently. Documentation is in code and
> selftests. :(
>
> +Optimized RSEQ V2
> +-----------------
> +
> +On architectures which utilize the generic entry code and generic TIF bits
> +the kernel supports runtime optimizations for RSEQ, which also enable
> +enhanced features like scheduler time slice extensions.
> +
> +To enable them a task has to register the RSEQ region with at least the
> +length advertised by getauxval(AT_RSEQ_FEATURE_SIZE).
> +
> +If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel
> +keeps the legacy low performance mode enabled to fulfil the expectations
> +of existing users regarding the original RSEQ implementation behaviour.
> +
> +The following table documents the ABI and behavioral guarantees of the
> +legacy and the optimized V2 mode.
> +
> +.. list-table:: RSEQ modes
> + :header-rows: 1
> +
> + * - Nr
> + - What
> +
> + - Legacy
> + - Optimized V2
> +
> + * - 1
> + - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read
> + only)
> + .. Legacy
> + - Updated by the kernel unconditionally after each context switch and
> + before signal delivery
> + .. Optimized V2
> + - Updated by the kernel if and only if they change, i.e. if the task
> + is migrated or mm_cid changes
> +
> + * - 2
> + - The rseq_cs critical section field
> + .. Legacy
> + - Evaluated and handled unconditionally after each context switch and
> + before signal delivery
> + .. Optimized V2
> + - Evaluated and handled conditionally only when user space was
> + interrupted and was scheduled out or before delivering a signal in
> + the interrupted context.
> +
> + * - 3
> + - Read only fields
> + .. Legacy
> + - No strict enforcement except in debug mode
> + .. Optimized V2
> + - Strict enforcement
> +
> + * - 4
> + - membarrier(...RSEQ)
> + .. Legacy
> + - All running threads of the process are interrupted and the ID fields
> + are rewritten and eventually active critical sections are aborted
> + before they return to user space. All threads which are scheduled
> + out whether voluntary or not are covered by #1/#2 above.
> + .. Optimized V2
> + - All running threads of the process are interrupted and eventually
> + active critical sections are aborted before these threads return to
> + user space. The ID fields are only updated if changed as a
> + consequence of the interrupt. All threads which are scheduled out
> + whether voluntary or not are covered by #1/#2 above.
> +
> + * - 5
> + - Time slice extensions
> + .. Legacy
> + - Not supported
> + .. Optimized V2
> + - Supported
> +
> +The legacy mode is obviously less performant as it does unconditional
> +updates and critical section checks even if not strictly required by the
> +ABI contract. That can't be changed anymore as some users depend on that
> +observed behavior, which in turn enables them to violate the ABI and
> +overwrite the cpu_id_start field for their own purposes. This is obviously
> +discouraged as it renders RSEQ incompatible with the intended usage and
> +breaks the expectation of other libraries in the same application.
> +
> +The ABI compliant optimized v2 mode, which respects the read only fields,
> +does not require unconditional updates and therefore is way more
> +performant. The kernel validates the read only fields for compliance. If
> +user space modifies them, the process is killed. Compliant usage allows
> +multiple libraries in the same application to benefit from the RSEQ
> +functionality without disturbing each other. The ABI compliant optimized v2
> +mode also enables extended RSEQ features like time slice extensions.
> +
> +
> Scheduler time slice extensions
> -------------------------------
>
> @@ -37,7 +128,8 @@ scheduled out inside of the critical sec
>
> * Enabled at boot time (default is enabled)
>
> - * A rseq userspace pointer has been registered for the thread
> + * A rseq userspace pointer has been registered for the thread in
> + optimized V2 mode
>
> The thread has to enable the functionality via prctl(2)::
>
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -413,70 +413,23 @@ static bool rseq_reset_ids(void)
> /* The original rseq structure size (including padding) is 32 bytes. */
> #define ORIG_RSEQ_SIZE 32
>
> -/*
> - * sys_rseq - setup restartable sequences for caller thread.
> - */
> -SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
> +static long rseq_register(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig)
> {
> u32 rseqfl = 0;
> u8 version = 1;
>
> - if (flags & RSEQ_FLAG_UNREGISTER) {
> - if (flags & ~RSEQ_FLAG_UNREGISTER)
> - return -EINVAL;
> - /* Unregister rseq for current thread. */
> - if (current->rseq.usrptr != rseq || !current->rseq.usrptr)
> - return -EINVAL;
> - if (rseq_len != current->rseq.len)
> - return -EINVAL;
> - if (current->rseq.sig != sig)
> - return -EPERM;
> - if (!rseq_reset_ids())
> - return -EFAULT;
> - rseq_reset(current);
> - return 0;
> - }
> -
> - if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)))
> - return -EINVAL;
> -
> - if (current->rseq.usrptr) {
> - /*
> - * If rseq is already registered, check whether
> - * the provided address differs from the prior
> - * one.
> - */
> - if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len)
> - return -EINVAL;
> - if (current->rseq.sig != sig)
> - return -EPERM;
> - /* Already registered. */
> - return -EBUSY;
> - }
> -
> - /*
> - * If there was no rseq previously registered, ensure the provided rseq
> - * is properly aligned, as communcated to user-space through the ELF
> - * auxiliary vector AT_RSEQ_ALIGN. If rseq_len is the original rseq
> - * size, the required alignment is the original struct rseq alignment.
> - *
> - * The rseq_len is required to be greater or equal to the original rseq
> - * size. In order to be valid, rseq_len is either the original rseq size,
> - * or large enough to contain all supported fields, as communicated to
> - * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE.
> - */
> - if (rseq_len < ORIG_RSEQ_SIZE ||
> - (rseq_len == ORIG_RSEQ_SIZE && !IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE)) ||
> - (rseq_len != ORIG_RSEQ_SIZE && (!IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) ||
> - rseq_len < offsetof(struct rseq, end))))
> - return -EINVAL;
> if (!access_ok(rseq, rseq_len))
> return -EFAULT;
>
> /*
> - * The version check effectivly disables time slice extensions until the
> - * RSEQ ABI V2 registration are implemented.
> + * Architectures, which use the generic IRQ entry code (at least) enable
> + * registrations with a size greater than the original v1 fixed sized
> + * @rseq_len, which has been validated already to utilize the optimized
> + * v2 ABI mode which also enables extended RSEQ features beyond MMCID.
> */
> + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_len > ORIG_RSEQ_SIZE)
> + version = 2;
> +
> if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION) && version > 1) {
> if (rseq_slice_extension_enabled()) {
> rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
> @@ -524,11 +477,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
> #endif
>
> /*
> - * If rseq was previously inactive, and has just been
> - * registered, ensure the cpu_id_start and cpu_id fields
> - * are updated before returning to user-space.
> + * Ensure the cpu_id_start and cpu_id fields are updated before
> + * returning to user-space.
> */
> - current->rseq.event.has_rseq = true;
> + current->rseq.event.has_rseq = version;
> rseq_force_update();
> return 0;
>
> @@ -536,6 +488,80 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
> return -EFAULT;
> }
>
> +static long rseq_unregister(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig)
> +{
> + if (flags & ~RSEQ_FLAG_UNREGISTER)
> + return -EINVAL;
> + if (current->rseq.usrptr != rseq || !current->rseq.usrptr)
> + return -EINVAL;
> + if (rseq_len != current->rseq.len)
> + return -EINVAL;
> + if (current->rseq.sig != sig)
> + return -EPERM;
> + if (!rseq_reset_ids())
> + return -EFAULT;
> + rseq_reset(current);
> + return 0;
> +}
> +
> +static long rseq_reregister(struct rseq __user * rseq, u32 rseq_len, u32 sig)
> +{
> + /*
> + * If rseq is already registered, check whether the provided address
> + * differs from the prior one.
> + */
> + if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len)
> + return -EINVAL;
> + if (current->rseq.sig != sig)
> + return -EPERM;
> + /* Already registered. */
> + return -EBUSY;
> +}
> +
> +static bool rseq_length_valid(struct rseq __user *rseq, unsigned int rseq_len)
> +{
> + /*
> + * Ensure the provided rseq is properly aligned, as communicated to
> + * user-space through the ELF auxiliary vector AT_RSEQ_ALIGN. If
> + * rseq_len is the original rseq size, the required alignment is the
> + * original struct rseq alignment.
> + *
> + * In order to be valid, rseq_len is either the original rseq size, or
> + * large enough to contain all supported fields, as communicated to
> + * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE.
> + */
> + if (rseq_len < ORIG_RSEQ_SIZE)
> + return false;
> +
> + if (rseq_len == ORIG_RSEQ_SIZE)
> + return IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE);
> +
> + return IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) &&
> + rseq_len >= offsetof(struct rseq, end);
> +}
> +
> +#define RSEQ_FLAGS_SUPPORTED (RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)
> +
> +/*
> + * sys_rseq - Register or unregister restartable sequences for the caller thread.
> + */
> +SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
> +{
> + if (flags & RSEQ_FLAG_UNREGISTER)
> + return rseq_unregister(rseq, rseq_len, flags, sig);
> +
> + if (unlikely(flags & ~RSEQ_FLAGS_SUPPORTED))
> + return -EINVAL;
> +
> + if (current->rseq.usrptr)
> + return rseq_reregister(rseq, rseq_len, sig);
> +
> + if (!rseq_length_valid(rseq, rseq_len))
> + return -EINVAL;
> +
> + return rseq_register(rseq, rseq_len, flags, sig);
> +}
> +
> #ifdef CONFIG_RSEQ_SLICE_EXTENSION
> struct slice_timer {
> struct hrtimer timer;
>