Thoughts of AMX KVM support based on latest kernel
From: Liu, Jing2
Date: Wed Nov 10 2021 - 08:01:28 EST
Hi Thomas and Paolo,
Thanks for your thoughts and suggestions. After reading the emails
and looking at the code, we'd like to explain our thoughts of AMX
KVM support based on latest kernel and the code from git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/fpu-kvm
AMX support based on existing design concepts
One of our objectives is to have a simple and clean KVM implementation by
utilizing the new dynamic extended-features handling in the FPU core.
Dynamic reallocation and "lazy passthrough"
The new code allows us to implement "lazy passthrough" of the XFD MSRs
by coupling a buffer reallocation request, which is indirectly made by vcpus
(VM exit). With "lazy passthrough" of the XFD MSR, we can avoid unnecessary
save/restore of the MSR and allocation of the extended features until the
guest really requires and is allowed to use. Until that point, the XFD MSR is
virtual, and thus we do not need to save/restore the actual MSR at VM
entry/exit time. And the vcpu does not have an extended state until that point.
Once the guest starts using the XFD feature (e.g. AMX) and it is permitted to
use it, we allow the guest to directly modify the MSR (passthrough) to avoid
(potentially frequent) VM exits.
Triggering of a reallocation request and error handling
First, we want to avoid weird guest failures at runtime due to (more likely)
permission failures of a reallocation request, checking the permissions of the
vcpu (for the extend features) at kvm_vcpu_ioctl_set_cpuid2() time, when
QEMU wants to advertise the extended features (e.g. AMX) for the first time.
We have no idea at vcpu_create() time whether QEMU wants to enable AMX
or not at that time. If kvm_vcpu_ioctl_set_cpuid2() succeeds, then there is
no need to further check permission in reallocation path.
Upon detection (interception) of an attempt by a vcpu to write to XCR0 (XSETBV)
and XFD (WRMSR), we check if the write is valid, and we start passthrough of
the XFD MSRs if the dynamic feature[i] meets the condition
XCR0[i]=1 && XFD[i]=0. And we make a reallocation request to the FPU core.
We simplify the KVM implementation by assuming that the reallocation
request was successful when the vcpu comes back to KVM. For such VM exit
handling that requires a buffer-reallocation request, we don't resume the
guest immediately. Instead, we go back to the userspace, to rely on the
userspace VMM (e.g. QEMU) for handling error cases. The actual reallocation
happens when control is transferred from KVM to the kernel (FPU core). If
no error, QEMU will come back to KVM by repeating vcpu_ioctl_run().
Potential failures there are due to lack of memory. But this would not be
interesting cases; the host should have more resource problems at that
time if that is the case.
Additional KVM-specific or and virtualization requirements
KVM needs to virtualize the XFD features, and we have additional
requirements.
XFD reset value
The XFD reset value needs to be 0.
KVM-specific XFD handling in XSAVES/XRSTORS
Once we start passthrough the XFD MSR, we need to save/restore
them at VM exit/entry time. If we immediately resume the guest
without enabling interrupts/preemptions (exit fast-path), we have no
issues. We don't need to save the MSR. The question is how the host
XFD MSR is restored while control is in KVM.
The XSAVE(S) instruction saves the (guest) state component[x] as 0 or
doesn't save when XFD[x] != 0. Accordingly, XRSTOR(S) cannot restore
that (guest state). And it is possible that XFD != 0 and the guest is using
extended feature at VM exit; we can check the XINUSE state-component
bitmap by XGETBV(1). By adding more meaning to the existing field:
fpstate->in_use, it can be useful for KVM to set the XINUSE value.
The usual VM exit handling in KVM, however, is done with
interrupt/preemption enabled. If a guest has a non-zero XFD and AMX
is in use at VM exit, the host and KVM need to maintain the guest state.
There are two cases where the host and KVM may lose the state:
a). KVM is scheduled out and kernel context switch does XSAVES,
b). KVM is interrupted and the softirq path calls
kernel_fpu_begin_mask(), which may execute XSAVES.
One crude way (Option 1) would be clear XFD temporarily at VM exit
time if the extended feature (AMX) is in use (XINUSE). It also causes
unnecessary overhead because interrupt/preemption may not always
happen.
Given the new unified handling of the XFD state management and
guest awareness in the FPU core, we think it might be better to defer
this to the host (Option 2):
a). Before the host kernel executes XSAVES, it clears XFD by checking if
this is a KVM guest fpu and if guest AMX is in use (XINUSE). KVM can
convey the condition by using fpstate->is_guest and fpstate->in_use,
for example. We need to add more meaning (and code changes) to
those fields.
b). Same for XRSTORS.
One of potential drawbacks of the Option 2 might be additional
checks in the host, although we can minimize the impact by having
CONFIG_KVM_TBD. We believe that the case
"XFD != 0 and XINUSE != 0" should be very infrequent.
Propagation of reallocation errors
As noted above, a reallocation request can fail, and we need to
propagate the error code to the userspace (e.g. QEMU) so that
it can handle the failure properly. Since we do not want to
terminate the guest after running due to permission errors
("weird failure"), we think we should check the permission at
set_cpuid2 time, return failure if no permission.
Looking forward to your comments.
Thanks,
Jing