Re: [PATCH v5 03/26] x86/hyperv: Update 'struct hv_enlightened_vmcs' definition
From: Sean Christopherson
Date: Mon Aug 22 2022 - 14:33:48 EST
On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
>
> > On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote:
> >> My initial implementation was inventing 'eVMCS revision' concept:
> >> https://lore.kernel.org/kvm/20220629150625.238286-7-vkuznets@xxxxxxxxxx/
> >>
> >> which is needed if we don't gate all these new fields on CPUID.0x4000000A.EBX BIT(0).
> >>
> >> Going forward, we will still (likely) need something when new fields show up.
> >
> > My comments from that thread still apply. Adding "revisions" or feature flags
> > isn't maintanable, e.g. at best KVM will end up with a ridiculous number of flags.
> >
> > Looking at QEMU, which I strongly suspect is the only VMM that enables
> > KVM_CAP_HYPERV_ENLIGHTENED_VMCS, it does the sane thing of enabling the capability
> > before grabbing the VMX MSRs.
> >
> > So, why not simply apply filtering for host accesses as well?
>
> (I understand that using QEMU to justify KVM's behavior is flawed but...)
>
> QEMU's migration depends on the assumption that identical QEMU's command
> lines create identical (from guest PoV) configurations. Assume we have
> (simplified)
>
> "-cpu CascadeLake-Sever,hv-evmcs"
>
> on both source and destination but source host is newer, i.e. its KVM
> knows about TSC Scaling in eVMCS and destination host has no idea about
> it. If we just apply filtering upon vCPU creation, guest visible MSR
> values are going to be different, right? Ok, assuming QEMU also migrates
> VMX feature MSRs (TODO: check if that's true), we will be able to fail
> mirgration late (which is already much worse than not being able to
> create the desired configuration on destination, 'fail early') if we use
> in-KVM filtering to throw an error to userspace. But if we blindly
> filter control MSRs on the destination, 'TscScaling' will just disapper
> undreneath the guest. This is unlikely to work.
But all of that holds true irrespetive of eVMCS. If QEMU attempts to migrate a
nested guest from a KVM that supports TSC_SCALING to a KVM that doesn't support
TSC_SCALING, then TSC_SCALING is going to disappear and VM-Entry on the dest will
fail. I.e. QEMU _must_ be able to detect the incompatibility and not attempt
the migration. And with that code in place, QEMU doesn't need to do anything new
for eVMCS, it Just Works.
> In any case, what we need, is an option for VMM (read: QEMU) to create
> the configuration with 'TscScaling' filtered out even KVM supports the
> bit in eVMCS. This way the guest will be able to migrate backwards to an
> older KVM which doesn't support it, i.e.
>
> '-cpu CascadeLake-Sever,hv-evmcs'
> creates the 'origin' eVMCS configuration, no TscScaling
>
> '-cpu CascadeLake-Sever,hv-evmcs,hv-evmcs-2022' creates the updated one.
Again, this conundrum exists irrespective of eVMCS. Properly solve the problem
for regular nVMX and eVMCS should naturally work.
> KVM_CAP_HYPERV_ENLIGHTENED_VMCS is bad as it only takes 'eVMCS' version
> as a parameter (as we assumed it will always change when new fields are
> added, but that turned out to be false). That's why I suggested
> KVM_CAP_HYPERV_ENLIGHTENED_VMCS2.
Enumerating features via versions is such a bad API though, e.g. if there's a
bug with nested TSC_SCALING, userspace can't disable just nested TSC_SCALING
without everything else under the inscrutable "hv-evmcs-2022" being collateral
damage.