Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size

From: Maxim Levitsky
Date: Tue Oct 31 2023 - 13:46:10 EST


On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> On Wed, Oct 25, 2023, Weijiang Yang wrote:
> > On 10/25/2023 1:07 AM, Sean Christopherson wrote:
> > > On Fri, Sep 15, 2023, Weijiang Yang wrote:
> > > IIUC, the "dynamic" features contains CET_KERNEL, whereas xfeatures_mask_supervisor()
> > > conatins PASID, CET_USER, and CET_KERNEL. PASID isn't virtualized by KVM, but
> > > doesn't that mean CET_USER will get dropped/lost if userspace requests AMX/XTILE
> > > enabling?
> >
> > Yes, __state_size is correct for guest enabled xfeatures, including CET_USER,
> > and it gets removed from __state_perm.
> >
> > IIUC, from current qemu/kernel interaction for guest permission settings,
> > __xstate_request_perm() is called only _ONCE_ to set AMX/XTILE for every vCPU
> > thread, so the removal of guest supervisor xfeatures won't hurt guest! ;-/
>
> Huh? I don't follow. What does calling __xstate_request_perm() only once have
> to do with anything?
>
> /me stares more
>
> OMG, hell no. First off, this code is a nightmare to follow. The existing comment
> is useless. No shit the code is adding in supervisor states for the host. What's
> not AT ALL clear is *why*.
>
> The commit says it's necessary because the "permission bitmap is only relevant
> for user states":
>
> commit 781c64bfcb735960717d1cb45428047ff6a5030c
> Author: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Date: Thu Mar 24 14:47:14 2022 +0100
>
> x86/fpu/xstate: Handle supervisor states in XSTATE permissions
>
> The size calculation in __xstate_request_perm() fails to take supervisor
> states into account because the permission bitmap is only relevant for user
> states.
>
> But @permitted comes from:
>
> permitted = xstate_get_group_perm(guest);
>
> which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm. And
> __state_perm is initialized to:
>
> fpu->perm.__state_perm = fpu_kernel_cfg.default_features;

Not anymore after patch 5, and patch 5 does make sense in the regard to the fact
that we might not want to save/restore kernel CET state for nothing for regular kernel threads.

>
> where fpu_kernel_cfg.default_features contains everything except the dynamic
> xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:
>
> fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
> fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>
> So why on earth does this code to force back xfeatures_mask_supervisor()? Because
> the code just below drops the supervisor bits to compute the user xstate size and
> then clobbers __state_perm.
>
> /* Calculate the resulting user state size */
> mask &= XFEATURE_MASK_USER_SUPPORTED;
> usize = xstate_calculate_size(mask, false);
>
> ...
>
> WRITE_ONCE(perm->__state_perm, mask);
>
> That is beyond asinine. IIUC, the intent is to apply the permission bitmap only
> for user states, because the only dynamic states are user states. Bbut the above
> creates an inconsistent mess. If userspace doesn't request XTILE_DATA,
> __state_perm will contain supervisor states, but once userspace does request
> XTILE_DATA, __state_perm will be lost.
>
> And because that's not confusing enough, clobbering __state_perm would also drop
> FPU_GUEST_PERM_LOCKED, except that __xstate_request_perm() can' be reached with
> said LOCKED flag set.
>
> fpu_xstate_prctl() already strips out supervisor features:
>
> case ARCH_GET_XCOMP_PERM:
> /*
> * Lockless snapshot as it can also change right after the
> * dropping the lock.
> */
> permitted = xstate_get_host_group_perm();
> permitted &= XFEATURE_MASK_USER_SUPPORTED;
> return put_user(permitted, uptr);
>
> case ARCH_GET_XCOMP_GUEST_PERM:
> permitted = xstate_get_guest_group_perm();
> permitted &= XFEATURE_MASK_USER_SUPPORTED;
> return put_user(permitted, uptr);
>
> and while KVM doesn't apply the __state_perm to supervisor states, if it did
> there would be zero harm in doing so.
>
> case 0xd: {
> u64 permitted_xcr0 = kvm_get_filtered_xcr0();
> u64 permitted_xss = kvm_caps.supported_xss;
>
> Second, the relying on QEMU to only trigger __xstate_request_perm() is not acceptable.
> It "works" for the current code, but only because there's only a single dynamic
> feature, i.e. this will short circuit and prevent computing a bad ksize.
>
> /* Check whether fully enabled */
> if ((permitted & requested) == requested)
> return 0;
>
> I don't know how I can possibly make it any clearer: KVM absolutely must not assume
> userspace behavior.
>
> So rather than continue with the current madness, which will break if/when the
> next dynamic feature comes along, just preserve non-user xfeatures/flags in
> __guest_perm.

I more or less agree with you, however I would like to discuss the FPU permissions
in more depth:


First of all we have two things at play here:

1. On demand resize of the thread's FPU state buffer to avoid penalty of context switching the AMX state.

2. The fact that allowing this on demand resize of this state buffer breaks the x86_64 ABI,
because FPU state has to be saved on the signal stack and ABI allows the stack size to be smaller than what is
needed to save the FPU state with AMX features enabled.

Thus a two tiered approach was done: first application asks for a permission to use the dynamic features,
and then when it actually uses it, the FPU state buffer is resized.

Otherwise if an AMX instruction is used by the app but the permission was not asked by the app for its xstate component,
the application is terminated.

(I might not 100% understand this correctly, please correct me if I am wrong).

However IMHO the 'fpu permission' name is a bit misleading,
This feature is not really about security/permissions but more like opt-in to use newer ABI,
for example KVM capabilities API, and the kernel will never refuse the permission request
(except if the
signal stack size is too small but the userspace can adjust it before asking for the permission)


On top of that I think that applying the same permission approach to guest's FPU state is not a good fit,
because of two reasons:

1. The guest FPU state will never be pushed on the signal stack - KVM swaps back the host FPU state
before it returns from the KVM_RUN ioctl.

Also I think (not sure) that ptrace can only access (FPU) state of a stopped process, and a stopped vCPU process
will also first return to userspace. Again I might be mistaken here, I never researched this in depth.

Assuming that I am correct on these assumptions, the guest FPU state can only be accessed via
KVM_GET_XSAVE/KVM_SET_XSAVE/KVM_GET_XSAVE2 ioctls,
which also returns the userspace portion of the state including optionally the AMX state,
but this ioctl doesn't really need FPU permission framework, because it is a KVM ABI, and in
fact KVM_GET_XSAVE2 was added exactly because of that: to make sure that userspace
is aware that larger than 4K buffer can be returned.

2. Guest FPU state is not even on demand resized (but I can imagine that in the future we will do this).


And of course, adding permissions for kernel features, that is even worse idea, which we really
shouldn't do.

>
> If there are no objections, I'll test the below and write a proper changelog.
>
> --
> From: Sean Christopherson <seanjc@xxxxxxxxxx>
> Date: Thu, 26 Oct 2023 10:17:33 -0700
> Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> __state_perm
>
> Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
> arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index ef6906107c54..73f6bc00d178 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> if ((permitted & requested) == requested)
> return 0;
>
> - /* Calculate the resulting kernel state size */
> + /*
> + * Calculate the resulting kernel state size. Note, @permitted also
> + * contains supervisor xfeatures even though supervisor are always
> + * permitted for kernel and guest FPUs, and never permitted for user
> + * FPUs.
> + */
> mask = permitted | requested;
> - /* Take supervisor states into account on the host */
> - if (!guest)
> - mask |= xfeatures_mask_supervisor();
> ksize = xstate_calculate_size(mask, compacted);

This might not work with kernel dynamic features, because xfeatures_mask_supervisor() will
return all supported supervisor features.


Therefore at least until we have an actual kernel dynamic feature
(a feature used by the host kernel and not KVM, and which has to be dynamic like AMX),
I suggest that KVM stops using the permission
API completely for the guest FPU state,
and just gives all the features it wants to enable right to __fpu_alloc_init_guest_fpstate()
(Guest FPU permission API IMHO should be deprecated and ignored)



>
> - /* Calculate the resulting user state size */
> - mask &= XFEATURE_MASK_USER_SUPPORTED;
> - usize = xstate_calculate_size(mask, false);
> + /*
> + * Calculate the resulting user state size. Take care not to clobber
> + * the supervisor xfeatures in the new mask!
> + */
> + usize = xstate_calculate_size(mask & XFEATURE_MASK_USER_SUPPORTED, false);
>
> if (!guest) {
> ret = validate_sigaltstack(usize);




Best regards,
Maxim Levitsky

>
> base-commit: c076acf10c78c0d7e1aa50670e9cc4c91e8d59b4