Re: [PATCH] Revert "x86/fpu: Refine and simplify the magic number check during signal return"
From: Andrei Vagin
Date: Wed Apr 29 2026 - 12:44:46 EST
On Wed, Apr 29, 2026 at 12:27 AM Chang S. Bae <chang.seok.bae@xxxxxxxxx> wrote:
>
> On 4/28/2026 5:06 PM, Andrei Vagin wrote:
> >
> > The reverted commit broke applications that construct signal frames in
> > userspace (such as CRIU and gVisor) if the frame's xstate size is
> > smaller than the kernel's fpstate->user_size.
>
> In the extended state area, the sigframe embeds the hardware-defined
> XSAVE format. If CPU A and CPU B support different XSTATE features, the
> layout (size and offsets) differ across systems. However, within a
> system, the layout is invariant. Userspace can query CPUID to obtain the
> exact offset and sizes, which effectively defines the ABI.
>
> On top of the XSAVE data, the kernel appends metadata (e.g. the xstate
> size and magic values). In particular fpstate->user_size is written by
> save_sw_bytes() at signal delivery. On sigreturn, the kernel validates
> this, which is a symmetric and straightforward check.
First of all, the reverted change broke backward compatibility for
user-space. There are at least two projects (gVisor and CRIU) that
worked correctly before this change. With the reverted commit, they
run into silent memory corruption. We usually try to avoid breaking
user-space like this without strong justification.
As for layout compatibility, in most cases CPU A (older) and CPU B
(newer) have compatible XSAVE layouts in terms of saving states on A
and restoring them on B. CPU B may feature new extended hardware
states, but the layout for previously supported components remains
the same. CRIU relies on this fact to allow users to migrate
processes from older to newer CPUs. CRIU can check whether
XSAVE states align across machines.
>
> Because the format is hardware-defined, arbitrary size mismatches should
> not be allowed. The sigframe should match the CPU-defined XSAVE layout.
> So the change in fact strengthens the sanity check.
>
> > Furthermore, this introduces a critical issue for checkpoint/restore
> > tools like CRIU. If a process is checkpointed while inside a signal
> > handler, its stack contains a signal frame formatted according to the
> > source host's xstate capabilities. If that process is later restored on
> > a destination host with larger xstate capabilities (e.g., a newer CPU
> > with more features enabled, resulting in a larger fpstate->user_size),
> > the kernel will look for FP_XSTATE_MAGIC2 at the destination host's
> > larger user_size offset instead of the offset encoded in the frame's
> > fx_sw->xstate_size. This causes the magic2 check to fail, forcing
> > sigreturn to silently fall back to "FX-only" mode.
>
> It seems that userspace could translate the XSAVE buffer from CPU A's
> format to CPU B's format during restore. If so, the frame can be
> consistent with the destination system without modifying
> fx_sw->xstate_size, and the kernel-side validation would continue to
> work as intended.
When checkpointing a process, CRIU cannot determine whether it is
currently executing within a signal handler, and it cannot find
signal frames on a user stack. In fact, there could be multiple
nested signal frames stacked on top of each other if a process
triggered additional signals while executing in an earlier handler.
Even if CRIU were somehow able to locate these frames, extending
them would be impossible. The target application stack is not
under our control, and other user stack data or local variables
reside immediately after the frame.
Thanks,
Andrei