Re: [PATCH v19 059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases
From: Edgecombe, Rick P
Date: Mon Mar 18 2024 - 19:46:31 EST
On Mon, 2024-02-26 at 00:26 -0800, isaku.yamahata@xxxxxxxxx wrote:
> From: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
>
> TDX supports only write-back(WB) memory type for private memory
> architecturally so that (virtualized) memory type change doesn't make
> sense
> for private memory. Also currently, page migration isn't supported
> for TDX
> yet. (TDX architecturally supports page migration. it's KVM and
> kernel
> implementation issue.)
>
> Regarding memory type change (mtrr virtualization and lapic page
> mapping
> change), pages are zapped by kvm_zap_gfn_range(). On the next KVM
> page
> fault, the SPTE entry with a new memory type for the page is
> populated.
> Regarding page migration, pages are zapped by the mmu notifier. On
> the next
> KVM page fault, the new migrated page is populated. Don't zap
> private
> pages on unmapping for those two cases.
Is the migration case relevant to TDX?
>
> When deleting/moving a KVM memory slot, zap private pages. Typically
> tearing down VM. Don't invalidate private page tables. i.e. zap only
> leaf
> SPTEs for KVM mmu that has a shared bit mask. The existing
> kvm_tdp_mmu_invalidate_all_roots() depends on role.invalid with read-
> lock
> of mmu_lock so that other vcpu can operate on KVM mmu concurrently.
> It
> marks the root page table invalid and zaps SPTEs of the root page
> tables. The TDX module doesn't allow to unlink a protected root page
> table
> from the hardware and then allocate a new one for it. i.e. replacing
> a
> protected root page table. Instead, zap only leaf SPTEs for KVM mmu
> with a
> shared bit mask set.
I get the part about only zapping leafs and not the root and mid-level
PTEs. But why the MTRR, lapic page and migration part? Why should those
not be zapped? Why is migration a consideration when it is not
supported?
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> ---
> arch/x86/kvm/mmu/mmu.c | 61
> ++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++----
> arch/x86/kvm/mmu/tdp_mmu.h | 5 ++--
> 3 files changed, 92 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0d6d4506ec97..30c86e858ae4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6339,7 +6339,7 @@ static void kvm_mmu_zap_all_fast(struct kvm
> *kvm)
> * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock
> and yield.
> */
> if (tdp_mmu_enabled)
> - kvm_tdp_mmu_invalidate_all_roots(kvm);
> + kvm_tdp_mmu_invalidate_all_roots(kvm, true);
>
> /*
> * Notify all vcpus to reload its shadow page table and flush
> TLB.
> @@ -6459,7 +6459,16 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> gfn_start, gfn_t gfn_end)
> flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>
> if (tdp_mmu_enabled)
> - flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start,
> gfn_end, flush);
> + /*
> + * zap_private = false. Zap only shared pages.
> + *
> + * kvm_zap_gfn_range() is used when MTRR or PAT
> memory
> + * type was changed. Later on the next kvm page
> fault,
> + * populate it with updated spte entry.
> + * Because only WB is supported for private pages,
> don't
> + * care of private pages.
> + */
> + flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start,
> gfn_end, flush, false);
>
> if (flush)
> kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end -
> gfn_start);
> @@ -6905,10 +6914,56 @@ void kvm_arch_flush_shadow_all(struct kvm
> *kvm)
> kvm_mmu_zap_all(kvm);
> }
>
> +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct
> kvm_memory_slot *slot)
What about kvm_mmu_zap_memslot_leafs() instead?
> +{
> + bool flush = false;
It doesn't need to be initialized if it passes false directly into
kvm_tdp_mmu_unmap_gfn_range(). It would make the code easier to
understand.
> +
> + write_lock(&kvm->mmu_lock);
> +
> + /*
> + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't
> required, worst
> + * case scenario we'll have unused shadow pages lying around
> until they
> + * are recycled due to age or when the VM is destroyed.
> + */
> + if (tdp_mmu_enabled) {
> + struct kvm_gfn_range range = {
> + .slot = slot,
> + .start = slot->base_gfn,
> + .end = slot->base_gfn + slot->npages,
> + .may_block = true,
> +
> + /*
> + * This handles both private gfn and shared
> gfn.
> + * All private page should be zapped on memslot
> deletion.
> + */
> + .only_private = true,
> + .only_shared = true,
only_private and only_shared are both true? Shouldn't they both be
false? (or just unset)
> + };
> +
> + flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range,
> flush);
> + } else {
> + /* TDX supports only TDP-MMU case. */
> + WARN_ON_ONCE(1);
How about a KVM_BUG_ON() instead? If somehow this is reached, we don't
want the caller thinking the pages are zapped, then enter the guest
with pages mapped that have gone elsewhere.
> + flush = true;
Why flush?
> + }
> + if (flush)
> + kvm_flush_remote_tlbs(kvm);
> +
> + write_unlock(&kvm->mmu_lock);
> +}
> +
> void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot)
> {
> - kvm_mmu_zap_all_fast(kvm);
> + if (kvm_gfn_shared_mask(kvm))
There seems to be an attempt to abstract away the existence of Secure-
EPT in mmu.c, that is not fully successful. In this case the code
checks kvm_gfn_shared_mask() to see if it needs to handle the zapping
in a way specific needed by S-EPT. It ends up being a little confusing
because the actual check is about whether there is a shared bit. It
only works because only S-EPT is the only thing that has a
kvm_gfn_shared_mask().
Doing something like (kvm->arch.vm_type == KVM_X86_TDX_VM) looks wrong,
but is more honest about what we are getting up to here. I'm not sure
though, what do you think?
> + /*
> + * Secure-EPT requires to release PTs from the leaf.
> The
> + * optimization to zap root PT first with child PT
> doesn't
> + * work.
> + */
> + kvm_mmu_zap_memslot(kvm, slot);
> + else
> + kvm_mmu_zap_all_fast(kvm);
> }
>
> void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index d47f0daf1b03..e7514a807134 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> * for zapping and thus puts the TDP MMU's reference to each
> root, i.e.
> * ultimately frees all roots.
> */
> - kvm_tdp_mmu_invalidate_all_roots(kvm);
> + kvm_tdp_mmu_invalidate_all_roots(kvm, false);
> kvm_tdp_mmu_zap_invalidated_roots(kvm);
>
> WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
> @@ -771,7 +771,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct
> kvm_mmu_page *sp)
> * operation can cause a soft lockup.
> */
> static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page
> *root,
> - gfn_t start, gfn_t end, bool can_yield,
> bool flush)
> + gfn_t start, gfn_t end, bool can_yield,
> bool flush,
> + bool zap_private)
> {
> struct tdp_iter iter;
>
> @@ -779,6 +780,10 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm,
> struct kvm_mmu_page *root,
>
> lockdep_assert_held_write(&kvm->mmu_lock);
>
> + WARN_ON_ONCE(zap_private && !is_private_sp(root));
All the callers have zap_private as zap_private && is_private_sp(root).
What badness is it trying to uncover?
> + if (!zap_private && is_private_sp(root))
> + return false;
> +