RE: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V

From: Alexander Grest
Date: Sun Mar 12 2023 - 13:42:29 EST


Yes, I agree that bumping the ASID doesn't flush Hyper-V's shadow page tables (if the guest opts into "enlightened TLB").

Here is the relevant section from the TLFS:

"The nested hypervisor can optionally opt into an "enlightened TLB" by setting EnlightenedNptTlb to "1" in HV_SVM_ENLIGHTENED_VMCB_FIELDS. If the nested hypervisor opts into the enlightenment, ASID invalidations just flush TLB entries derived from first level address translation (i.e. the virtual address space). To flush TLB entries derived from the nested page table (NPT) and force the L0 hypervisor to rebuild shadow page tables, the HvCallFlushGuestPhysicalAddressSpace or HvCallFlushGuestPhysicalAddressList hypercalls must be used."

(see https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/nested-virtualization)

Thanks,
Alex

-----Original Message-----
From: Jeremi Piotrowski <jpiotrowski@xxxxxxxxxxxxxxxxxxx>
Sent: Thursday, March 9, 2023 9:59 AM
To: Alexander Grest <Alexander.Grest@xxxxxxxxxxxxx>
Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>; Tianyu Lan <ltykernel@xxxxxxxxx>; Michael Kelley (LINUX) <mikelley@xxxxxxxxxxxxx>; Sean Christopherson <seanjc@xxxxxxxxxx>
Subject: Re: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V

@Alex,
do you know the answer to Sean's question below about ASID handling in Hyper-V?
Any other comments about how we're intending to fix things are also welcome.

Jeremi

On 08/03/2023 20:11, Sean Christopherson wrote:
> On Wed, Mar 08, 2023, Jeremi Piotrowski wrote:
>> On 08/03/2023 01:39, Sean Christopherson wrote:
>>> On Wed, Mar 08, 2023, Paolo Bonzini wrote:
>>>> On Tue, Mar 7, 2023 at 6:36 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>>>>> Thinking about this more, I would rather revert commit 1e0c7d40758b ("KVM: SVM:
>>>>> hyper-v: Remote TLB flush for SVM") or fix the thing properly
>>>>> straitaway. KVM doesn't magically handle the flushes correctly
>>>>> for the shadow/legacy MMU, KVM just happens to get lucky and not run afoul of the underlying bugs.
>>>>
>>>> I don't think it's about luck---the legacy MMU's
>>>> zapping/invalidation seems to invoke the flush hypercall correctly:
>>>
>>> ...for the paths that Jeremi has exercised, and for which a stale
>>> TLB entry is fatal to L2. E.g. kvm_unmap_gfn_range() does not have
>>> a range-based TLB flush in its path and fully relies on the buggy kvm_flush_remote_tlbs().
>>>
>>
>> Why do you say "buggy kvm_flush_remote_tlbs"? kvm_flush_remote_tlbs
>> calls the hypercall that is needed, I don't see how this might be an
>> issue of a missing "range-based TLB flush".
>
> Doh, I forgot that the arch hook in kvm_flush_remote_tlbs() leads to
> the Hyper-V hook.
>
> svm_flush_tlb_current is very much broken, but in practice it doesn't
> matter outside of the direct call from kvm_mmu_load(), because in all
> other paths KVM will flow through a Hyper-V flush if KVM actually
> modifies its MMU in any ways. E.g. the request from kvm_mmu_new_pgd()
> when force_flush_and_sync_on_reuse=true is neutered, but that exists
> only as a safeguard against MMU bugs. And for things like
> kvm_invalidate_pcid() and kvm_post_set_cr4(), my understanding is that
> Hyper-V will still flush the bare metal TLB, it's only Hyper-V's
> shadow page tables that are stale.
>
> Depending on how Hyper-V handles ASIDs, pre_svm_run() may also be
> broken. If Hyper-V tracks and rebuilds only the current ASID, then
> bumping the ASID won't rebuild potentially stale page tables. But I'm
> guessing Hyper-V ignores the ASID since the hypercall takes only the root PA.
>
> The truly problematic case is kvm_mmu_load(), where KVM relies on the
> flush to force Hyper-V to rebuild shadow page tables for an old,
> possibly stale nCR3. This affects only the TDP MMU because of an
> explicit optimization in the TDP MMU. So in practice we could squeak by with something like this:
>
> if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb)
> hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa);
> else
> static_call(kvm_x86_flush_tlb_current)(vcpu);
>
> but I'm not convinced that avoiding a hypercall in
> svm_flush_tlb_current() just to avoid overhead when running an L3
> (nested VM from L1 KVM's perspective) is worthwhile. The real problem
> there is that KVM nested SVM TLB/ASID support is an unoptimized mess,
> and I can't imagine someone running an L3 is going to be super concerned with performance.
>
> I also think we should have a sanity check in the flush_tlb_all()
> path, i.e. WARN if kvm_flush_remote_tlbs() falls back.
>
> Something like this (probably doesn't compile, likely needs #ifdefs or helpers):
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index
> 7ef4f9e3b35a..38afc5cac1c4 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3770,7 +3770,7 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
> svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF); }
>
> -static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
> +static void svm_flush_tlb_asid(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
>
> @@ -3794,6 +3794,23 @@ static void svm_flush_tlb_current(struct kvm_vcpu *vcpu)
> svm->current_vmcb->asid_generation--;
> }
>
> +static void svm_flush_tlb_current(struct kvm_vcpu *vcpu) {
> + if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb &&
> + VALID_PAGE(vcpu->arch.mmu->root.hpa))
> + hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa);
> +
> + svm_flush_tlb_asid(vcpu);
> +}
> +
> +static void svm_flush_tlb_all(struct kvm_vcpu *vcpu) {
> + if (WARN_ON_ONCE(kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb))
> + hv_remote_flush_tlb(vcpu->kvm);
> +
> + svm_flush_tlb_asid(vcpu);
> +}
> +
> static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva) {
> struct vcpu_svm *svm = to_svm(vcpu); @@ -4786,10 +4803,10 @@
> static struct kvm_x86_ops svm_x86_ops __initdata = {
> .set_rflags = svm_set_rflags,
> .get_if_flag = svm_get_if_flag,
>
> - .flush_tlb_all = svm_flush_tlb_current,
> + .flush_tlb_all = svm_flush_tlb_all,
> .flush_tlb_current = svm_flush_tlb_current,
> .flush_tlb_gva = svm_flush_tlb_gva,
> - .flush_tlb_guest = svm_flush_tlb_current,
> + .flush_tlb_guest = svm_flush_tlb_asid,
>
> .vcpu_pre_run = svm_vcpu_pre_run,
> .vcpu_run = svm_vcpu_run,