Re: [PATCH v5 03/26] x86/hyperv: Update 'struct hv_enlightened_vmcs' definition
From: Vitaly Kuznetsov
Date: Tue Aug 23 2022 - 03:33:18 EST
Sean Christopherson <seanjc@xxxxxxxxxx> writes:
> On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote:
>> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
>>
>> > On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote:
>> >> My initial implementation was inventing 'eVMCS revision' concept:
>> >> https://lore.kernel.org/kvm/20220629150625.238286-7-vkuznets@xxxxxxxxxx/
>> >>
>> >> which is needed if we don't gate all these new fields on CPUID.0x4000000A.EBX BIT(0).
>> >>
>> >> Going forward, we will still (likely) need something when new fields show up.
>> >
>> > My comments from that thread still apply. Adding "revisions" or feature flags
>> > isn't maintanable, e.g. at best KVM will end up with a ridiculous number of flags.
>> >
>> > Looking at QEMU, which I strongly suspect is the only VMM that enables
>> > KVM_CAP_HYPERV_ENLIGHTENED_VMCS, it does the sane thing of enabling the capability
>> > before grabbing the VMX MSRs.
>> >
>> > So, why not simply apply filtering for host accesses as well?
>>
>> (I understand that using QEMU to justify KVM's behavior is flawed but...)
>>
>> QEMU's migration depends on the assumption that identical QEMU's command
>> lines create identical (from guest PoV) configurations. Assume we have
>> (simplified)
>>
>> "-cpu CascadeLake-Sever,hv-evmcs"
>>
>> on both source and destination but source host is newer, i.e. its KVM
>> knows about TSC Scaling in eVMCS and destination host has no idea about
>> it. If we just apply filtering upon vCPU creation, guest visible MSR
>> values are going to be different, right? Ok, assuming QEMU also migrates
>> VMX feature MSRs (TODO: check if that's true), we will be able to fail
>> mirgration late (which is already much worse than not being able to
>> create the desired configuration on destination, 'fail early') if we use
>> in-KVM filtering to throw an error to userspace. But if we blindly
>> filter control MSRs on the destination, 'TscScaling' will just disapper
>> undreneath the guest. This is unlikely to work.
>
> But all of that holds true irrespetive of eVMCS. If QEMU attempts to migrate a
> nested guest from a KVM that supports TSC_SCALING to a KVM that doesn't support
> TSC_SCALING, then TSC_SCALING is going to disappear and VM-Entry on the dest will
> fail. I.e. QEMU _must_ be able to detect the incompatibility and not attempt
> the migration. And with that code in place, QEMU doesn't need to do anything new
> for eVMCS, it Just Works.
I'm obviously missing something. "-cpu CascadeLake-Sever" presumes
cetain features, including VMX features (e.g. TSC_SCALING), an attempt
to create such vCPU on a CPU which doesn't support it will lead to
immediate failure. So two VMs created on different hosts with
-cpu CascadeLake-Sever
are guaranteed to look exactly the same from guest PoV. This is not true
for '-cpu CascadeLake-Sever,hv-evmcs' (if we do it the way you suggest)
as 'hv-evmcs' will be a *different* filter on each host (which is going
to depend on KVM version, not even on the host's hardware).
>
>> In any case, what we need, is an option for VMM (read: QEMU) to create
>> the configuration with 'TscScaling' filtered out even KVM supports the
>> bit in eVMCS. This way the guest will be able to migrate backwards to an
>> older KVM which doesn't support it, i.e.
>>
>> '-cpu CascadeLake-Sever,hv-evmcs'
>> creates the 'origin' eVMCS configuration, no TscScaling
>>
>> '-cpu CascadeLake-Sever,hv-evmcs,hv-evmcs-2022' creates the updated one.
>
> Again, this conundrum exists irrespective of eVMCS. Properly solve the problem
> for regular nVMX and eVMCS should naturally work.
I don't think we have this problem for VMX features as named CPU models
in QEMU encode all of them explicitly, they *must* be present whenever
such vCPU is created.
>
>> KVM_CAP_HYPERV_ENLIGHTENED_VMCS is bad as it only takes 'eVMCS' version
>> as a parameter (as we assumed it will always change when new fields are
>> added, but that turned out to be false). That's why I suggested
>> KVM_CAP_HYPERV_ENLIGHTENED_VMCS2.
>
> Enumerating features via versions is such a bad API though, e.g. if there's a
> bug with nested TSC_SCALING, userspace can't disable just nested TSC_SCALING
> without everything else under the inscrutable "hv-evmcs-2022" being collateral
> damage.
Why? Something like
"-cpu CascadeLake-Sever,hv-evmcs,hv-evmcs-2022,-vmx-tsc-scaling"
should work well, no? 'hv-evmcs*' are just filters, if the VMX feature
is not there -- it's not there.
We can certainly make this fine-grained and introduce
KVM_CAP_HYPERV_EVMCS_TSC_SCALING instead and make a 'hv-*' flag for it,
however, with Hyper-V and Windows I really don't like 'frankenstein'
Hyper-V configurations which look nothing like genuine Hyper-V as nobody
at Microsoft tests new Windows builds with such configurations. And yes,
bugs were discovered in the past (e.g. PerfGlobalCtrl and Win11).
--
Vitaly