Re: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V

From: Vitaly Kuznetsov
Date: Tue Mar 07 2023 - 05:08:37 EST


Jeremi Piotrowski <jpiotrowski@xxxxxxxxxxxxxxxxxxx> writes:

> On 06/03/2023 18:52, Vitaly Kuznetsov wrote:
>> Jeremi Piotrowski <jpiotrowski@xxxxxxxxxxxxxxxxxxx> writes:
>>
>>> TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17.
>>> The issue was first introduced by two commmits:
>>>
>>> - bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB
>>> flush to caller when freeing TDP MMU shadow pages"
>>> - efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct
>>> roots via asynchronous worker"
>>>
>>> The root cause is that since then there are missing TLB flushes which
>>> are required by HV_X64_NESTED_ENLIGHTENED_TLB.
>>
>> Please share more details on what's actually missing as you get them,
>> I'd like to understand which flushes can be legally avoided on bare
>> hardware and Hyper-V/VMX but not on Hyper-V/SVM.
>>
>
> See the linked thread here
> https://lore.kernel.org/lkml/20d189fc-8d20-8083-b448-460cc0420151@xxxxxxxxxxxxxxxxxxx/#t
> for all the details/analyses but the summary was that either of these 2
> options would work, with a) having less flushes (footnote: less flushes is not necessarily
> better):
>
> a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp()
> b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current()
>
> These are only needed on Hyper-V/SVM because of how the enlightenment works (needs an explicit
> flush to rebuild L0 shadow page tables). Hyper-V/VMX does not need any changes and currently
> works. Let me know if you need more information on something here, I'll try to get it.
>

Ah, I missed the whole party! Thanks for the pointers!

>>> The failure manifests
>>> as L2 guest VMs being unable to complete boot due to memory
>>> inconsistencies between L1 and L2 guests which lead to various
>>> assertion/emulation failures.

Which levels are we talking about here, *real* L1 and L2 or L1 and L2
from KVM's perspective (real L2 and L3)?

>>>
>>> The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by
>>> Hyper-V on AMD and is always used by Linux. The TLB flush required by
>>> HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush
>>> that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest
>>> boot performance on AMD is reproducibly slower compared to when TDP MMU is
>>> disabled.
>>>
>>> Disable TDP MMU when using SVM Hyper-V for the time being while we
>>> search for a better fix.
>>
>> I'd suggest we go the other way around: disable
>> HV_X64_NESTED_ENLIGHTENED_TLB on SVM:
>
> Paolo suggested disabling TDP_MMU when HV_X64_NESTED_ENLIGHTENED_TLB is used, and
> I prefer that option too. The enlighenment does offer a nice performance advantage
> with non-TDP_MMU, and I did not see TDP_MMU perform any better compared to that.
> Afaik the code to use the enlightenment on Hyper-V/SVM was written/tested before
> TDP_MMU became the default.
>
> If you have a specific scenario in mind, we could test and see what the implications
> are there.

I don't have a strong opinion here, I've suggested a smaller change so
it's easier to backport it to stable kernels and easier to revert when a
proper fix comes to mainline. For performance implication, I'd only
consider non-nested scenarios from KVM's perspective (i.e. real L2 from
Hyper-V's PoV), as running L3 is unlikely a common use-case and, if I
understood correctly, is broken anyway.

>
>>
>> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
>> index 6981c1e9a809..be98da5a4277 100644
>> --- a/arch/x86/kvm/svm/svm_onhyperv.h
>> +++ b/arch/x86/kvm/svm/svm_onhyperv.h
>> @@ -32,7 +32,8 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>>
>> static inline void svm_hv_hardware_setup(void)
>> {
>> - if (npt_enabled &&
>> + /* A comment about missing TLB flushes */
>> + if (!tdp_mmu_enabled && npt_enabled &&
>> ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) {
>> pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n");
>> svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
>>
>> this way we won't have a not-obvious-at-all MMU change on Hyper-V. I
>> understand this may have some performance implications but MMU switch
>> has some as well.
>>
>>>
>>> Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@xxxxxxxxxxxxxxxxxxx/t/#u
>>> Signed-off-by: Jeremi Piotrowski <jpiotrowski@xxxxxxxxxxxxxxxxxxx>
>>> ---
>>> Based on kvm-x86-mmu-6.3. The approach used here does not apply cleanly to
>>> <=v6.2. This would be needed in stable too, and I don't know about putting
>>> fixes tags.
>>
>> Cc: stable@xxxxxxxxxxxxxxx # 5.17.0
>>
>> should do)
>>
>>>
>>> Jeremi
>>>
>>> arch/x86/include/asm/kvm_host.h | 3 ++-
>>> arch/x86/kvm/mmu/mmu.c | 5 +++--
>>> arch/x86/kvm/svm/svm.c | 6 +++++-
>>> arch/x86/kvm/svm/svm_onhyperv.h | 10 ++++++++++
>>> arch/x86/kvm/vmx/vmx.c | 3 ++-
>>> 5 files changed, 22 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index 4d2bc08794e4..a0868ae3688d 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -2031,7 +2031,8 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
>>> void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
>>>
>>> void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>>> - int tdp_max_root_level, int tdp_huge_page_level);
>>> + int tdp_max_root_level, int tdp_huge_page_level,
>>> + bool enable_tdp_mmu);
>>>
>>> static inline u16 kvm_read_ldt(void)
>>> {
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index c91ee2927dd7..5c0e28a7a3bc 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -5787,14 +5787,15 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
>>> }
>>>
>>> void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>>> - int tdp_max_root_level, int tdp_huge_page_level)
>>> + int tdp_max_root_level, int tdp_huge_page_level,
>>> + bool enable_tdp_mmu)
>>> {
>>> tdp_enabled = enable_tdp;
>>> tdp_root_level = tdp_forced_root_level;
>>> max_tdp_level = tdp_max_root_level;
>>>
>>> #ifdef CONFIG_X86_64
>>> - tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
>>> + tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu;
>>> #endif
>>> /*
>>> * max_huge_page_level reflects KVM's MMU capabilities irrespective
>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>> index d13cf53e7390..070c3f7f8c9f 100644
>>> --- a/arch/x86/kvm/svm/svm.c
>>> +++ b/arch/x86/kvm/svm/svm.c
>>> @@ -4925,6 +4925,7 @@ static __init int svm_hardware_setup(void)
>>> struct page *iopm_pages;
>>> void *iopm_va;
>>> int r;
>>> + bool enable_tdp_mmu;
>>> unsigned int order = get_order(IOPM_SIZE);
>>>
>>> /*
>>> @@ -4991,9 +4992,12 @@ static __init int svm_hardware_setup(void)
>>> if (!boot_cpu_has(X86_FEATURE_NPT))
>>> npt_enabled = false;
>>>
>>> + enable_tdp_mmu = svm_hv_enable_tdp_mmu();
>>> +
>>> /* Force VM NPT level equal to the host's paging level */
>>> kvm_configure_mmu(npt_enabled, get_npt_level(),
>>> - get_npt_level(), PG_LEVEL_1G);
>>> + get_npt_level(), PG_LEVEL_1G,
>>> + enable_tdp_mmu);
>>> pr_info("Nested Paging %sabled\n", npt_enabled ? "en" : "dis");
>>>
>>> /* Setup shadow_me_value and shadow_me_mask */
>>> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
>>> index 6981c1e9a809..aa49ac5d66bc 100644
>>> --- a/arch/x86/kvm/svm/svm_onhyperv.h
>>> +++ b/arch/x86/kvm/svm/svm_onhyperv.h
>>> @@ -30,6 +30,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>>> hve->hv_enlightenments_control.msr_bitmap = 1;
>>> }
>>>
>>> +static inline bool svm_hv_enable_tdp_mmu(void)
>>> +{
>>> + return !(npt_enabled && ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB);
>>> +}
>>> +
>>> static inline void svm_hv_hardware_setup(void)
>>> {
>>> if (npt_enabled &&
>>> @@ -84,6 +89,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>>> {
>>> }
>>>
>>> +static inline bool svm_hv_enable_tdp_mmu(void)
>>> +{
>>> + return true;
>>> +}
>>> +
>>> static inline void svm_hv_hardware_setup(void)
>>> {
>>> }
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index c788aa382611..4d3808755d39 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -8442,7 +8442,8 @@ static __init int hardware_setup(void)
>>> vmx_setup_me_spte_mask();
>>>
>>> kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(),
>>> - ept_caps_to_lpage_level(vmx_capability.ept));
>>> + ept_caps_to_lpage_level(vmx_capability.ept),
>>> + true);
>>>
>>> /*
>>> * Only enable PML when hardware supports PML feature, and both EPT
>>
>

--
Vitaly