Re: [PATCH v4 1/8] x86/fpu/xstate: Always preserve non-user xfeatures/flags in __state_perm

From: Chang S. Bae
Date: Tue Apr 01 2025 - 13:20:11 EST


On 3/18/2025 8:31 AM, Chao Gao wrote:

When granting userspace or a KVM guest access to an xfeature, preserve the
entity's existing supervisor and software-defined permissions as tracked
by __state_perm, i.e. use __state_perm to track *all* permissions even
though all supported supervisor xfeatures are granted to all FPUs and
FPU_GUEST_PERM_LOCKED disallows changing permissions.

Effectively clobbering supervisor permissions results in inconsistent
behavior, as xstate_get_group_perm() will report supervisor features for
process that do NOT request access to dynamic user xfeatures, whereas any
and all supervisor features will be absent from the set of permissions for
any process that is granted access to one or more dynamic xfeatures (which
right now means AMX).

The inconsistency isn't problematic because fpu_xstate_prctl() already
strips out everything except user xfeatures:

case ARCH_GET_XCOMP_PERM:
/*
* Lockless snapshot as it can also change right after the
* dropping the lock.
*/
permitted = xstate_get_host_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);

case ARCH_GET_XCOMP_GUEST_PERM:
permitted = xstate_get_guest_group_perm();
permitted &= XFEATURE_MASK_USER_SUPPORTED;
return put_user(permitted, uptr);

and similarly KVM doesn't apply the __state_perm to supervisor states
(kvm_get_filtered_xcr0() incorporates xstate_get_guest_group_perm()):

case 0xd: {
u64 permitted_xcr0 = kvm_get_filtered_xcr0();
u64 permitted_xss = kvm_caps.supported_xss;

But if KVM in particular were to ever change, dropping supervisor
permissions would result in subtle bugs in KVM's reporting of supported
CPUID settings. And the above behavior also means that having supervisor
xfeatures in __state_perm is correctly handled by all users.

Dropping supervisor permissions also creates another landmine for KVM. If
more dynamic user xfeatures are ever added, requesting access to multiple
xfeatures in separate ARCH_REQ_XCOMP_GUEST_PERM calls will result in the
second invocation of __xstate_request_perm() computing the wrong ksize, as
as the mask passed to xstate_calculate_size() would not contain *any*
supervisor features.

Commit 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE
permissions") fudged around the size issue for userspace FPUs, but for
reasons unknown skipped guest FPUs. Lack of a fix for KVM "works" only
because KVM doesn't yet support virtualizing features that have supervisor
xfeatures, i.e. as of today, KVM guest FPUs will never need the relevant
xfeatures.

Simply extending the hack-a-fix for guests would temporarily solve the
ksize issue, but wouldn't address the inconsistency issue and would leave
another lurking pitfall for KVM. KVM support for virtualizing CET will
likely add CET_KERNEL as a guest-only xfeature, i.e. CET_KERNEL will not
be set in xfeatures_mask_supervisor() and would again be dropped when
granting access to dynamic xfeatures.

Note, the existing clobbering behavior is rather subtle. The @permitted
parameter to __xstate_request_perm() comes from:

permitted = xstate_get_group_perm(guest);

which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm,
where __state_perm is initialized to:

fpu->perm.__state_perm = fpu_kernel_cfg.default_features;

and copied to the guest side of things:

/* Same defaults for guests */
fpu->guest_perm = fpu->perm;

fpu_kernel_cfg.default_features contains everything except the dynamic
xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:

fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;

When __xstate_request_perm() restricts the local "mask" variable to
compute the user state size:

mask &= XFEATURE_MASK_USER_SUPPORTED;
usize = xstate_calculate_size(mask, false);

it subtly overwrites the target __state_perm with "mask" containing only
user xfeatures:

perm = guest ? &fpu->guest_perm : &fpu->perm;
/* Pairs with the READ_ONCE() in xstate_get_group_perm() */
WRITE_ONCE(perm->__state_perm, mask);

This changelog appears to be largely derived from Sean’s previous email. I think it can be significantly shortened to focus on the key
points, such as:

x86/fpu/xstate: Preserve non-user bits in permission handling

When granting userspace or a KVM guest access to an xfeature, the task
leader’s perm->__state_perm (host or guest) is overwritten, unintentionally discarding non-user bits. Additionally, supervisor state permissions are always granted.

The current behavior presents the following issues:

* Inconsistencies in permission handling – Supervisor permissions are
universally granted, and the FPU_GUEST_PERM_LOCKED bit is explicitly
set to prevent permission changes.

* Redundant permission setting – Since supervisor state permissions
are always granted, the permitted mask already includes them, making
it unnecessary to set them again.

Ensure that __xstate_request_perm() does not inadvertently drop
supervisor and software-defined permissions. Also, avoid redundantly
granting supervisor state permissions, and document this behavior in the
code comments.

Clarify the presence of non-user feature and flag bits in the field
description.