Re: [patch 13/31] x86/fpu: Move KVMs FPU swapping to FPU core

From: Paolo Bonzini
Date: Thu Oct 14 2021 - 05:01:38 EST


On 14/10/21 10:02, Liu, Jing2 wrote:
In principle I don't like it very much; it would be nicer to say "you
enable it for QEMU itself via arch_prctl(ARCH_SET_STATE_ENABLE), and for
the guests via ioctl(KVM_SET_CPUID2)". But I can see why you want to
keep things simple, so it's not a strong objection at all.

Does this mean that KVM allocate 3 buffers via
1) Qemu's request, instead of via 2) guest XCR0 trap?

Based on the input from Andy and Thomas, the new way would be like this:

1) host_fpu must always be checked for reallocation in kvm_load_guest_fpu (or in the FPU functions that it calls, that depends on the rest of Thomas's patches). That's because arch_prctl can enable AMX for QEMU at any point after KVM_CREATE_VCPU.

2) every use of vcpu->arch.guest_supported_xcr0 is changed to only include those dynamic-feature bits that were enabled via arch_prctl.
That is, something like:

static u64 kvm_guest_supported_cr0(struct kvm_vcpu *vcpu)
{
return vcpu->arch.guest_supported_xcr0 &
(~xfeatures_mask_user_dynamic | \
current->thread.fpu.dynamic_state_perm);
}

3) Even with passthrough disabled, the guest can run with XFD set to vcpu->arch.guest_xfd (and likewise for XFD_ERR) which is much simpler than trapping #NM. The traps for writing XCR0 and XFD are used to allocate dynamic state for guest_fpu, and start the passthrough of XFD and XFD_ERR. What we need is:

- if a dynamic state has XCR0[n]=0, bit n will never be set in XFD_ERR and the state will never be dirtied by the guest.

- if a dynamic state has XCR0[n]=1, but all enabled dynamic states have XFD[n]=1, the guest is not able to dirty any dynamic XSAVE state, because they all have either XCR0[n]=0 or XFD[n]=1. An attempt to do so will cause an #NM trap and set the bit in XFD_ERR.

- if a dynamic state has XCR0[n]=1 and XFD[n]=0, the state for bit n is allocated in guest_fpu, and it can also disable the vmexits for XFD and XFD_ERR.

Therefore:

- if passthrough is disabled, the XCR0 and XFD write traps can check guest_xcr0 & ~guest_xfd. If it includes a dynamic state bit, dynamic state is allocated for all bits enabled in guest_xcr0 and passthrough is started; this should happen shortly after the guest gets its first #NM trap for AMX.

- if passthrough is enabled, the XCR0 write trap must still ensure that dynamic state is allocated for all bits enabled in guest_xcr0.

So something like this pseudocode is called by both XCR0 and XFD writes:

int kvm_alloc_fpu_dynamic_features(struct kvm_vcpu *vcpu)
{
u64 allowed_dynamic = current->thread.fpu.dynamic_state_perm;
u64 enabled_dynamic =
vcpu->arch.xcr0 & xfeatures_mask_user_dynamic;

/* All dynamic features have to be arch_prctl'd first. */
WARN_ON_ONCE(enabled_dynamic & ~allowed_dynamic);

if (!vcpu->arch.xfd_passthrough) {
/* All dynamic states will #NM? Wait and see. */
if ((enabled_dynamic & ~vcpu->arch.xfd) == 0)
return 0;

kvm_x86_ops.enable_xfd_passthrough(vcpu);
}

/* current->thread.fpu was already handled by arch_prctl. */
return fpu_alloc_features(vcpu->guest_fpu,
vcpu->guest_fpu.dynamic_state_perm | enabled_dynamic);
}

Paolo