Re: [PATCH v10 047/108] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases

From: Huang, Kai
Date: Wed Dec 14 2022 - 06:18:28 EST


On Sat, 2022-10-29 at 23:22 -0700, isaku.yamahata@xxxxxxxxx wrote:
> From: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
>
> TDX supports only write-back(WB) memory type for private memory
> architecturally so that (virtualized) memory type change doesn't make sense
> for private memory. Also currently, page migration isn't supported for TDX
> yet. (TDX architecturally supports page migration. it's KVM and kernel
> implementation issue.)
>
> Regarding memory type change (mtrr virtualization and lapic page mapping
> change), pages are zapped by kvm_zap_gfn_range(). On the next KVM page
> fault, the SPTE entry with a new memory type for the page is populated.
> Regarding page migration, pages are zapped by the mmu notifier. On the next
> KVM page fault, the new migrated page is populated. Don't zap private
> pages on unmapping for those two cases.
>
> When deleting/moving a KVM memory slot, zap private pages. Typically
> tearing down VM. Don't invalidate private page tables. i.e. zap only leaf
> SPTEs for KVM mmu that has a shared bit mask. The existing
> kvm_tdp_mmu_invalidate_all_roots() depends on role.invalid with read-lock
> of mmu_lock so that other vcpu can operate on KVM mmu concurrently. It
> marks the root page table invalid and zaps SPTEs of the root page
> tables. The TDX module doesn't allow to unlink a protected root page table
> from the hardware and then allocate a new one for it. i.e. replacing a
> protected root page table. Instead, zap only leaf SPTEs for KVM mmu with a
> shared bit mask set.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> ---
> arch/x86/kvm/mmu/mmu.c | 85 ++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/mmu/tdp_mmu.c | 24 ++++++++---
> arch/x86/kvm/mmu/tdp_mmu.h | 5 ++-
> 3 files changed, 103 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index faf69774c7ce..0237e143299c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1577,8 +1577,38 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> if (kvm_memslots_have_rmaps(kvm))
> flush = kvm_handle_gfn_range(kvm, range, kvm_zap_rmap);
>
> - if (is_tdp_mmu_enabled(kvm))
> - flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
> + if (is_tdp_mmu_enabled(kvm)) {
> + bool zap_private;
> +
> + if (kvm_slot_can_be_private(range->slot)) {
> + if (range->flags & KVM_GFN_RANGE_FLAGS_RESTRICTED_MEM)
> + /*
> + * For private slot, the callback is triggered
> + * via falloc. Mode can be allocation or punch
^
fallocate(), please?

> + * hole. Because the private-shared conversion
> + * is done via
> + * KVM_MEMORY_ENCRYPT_REG/UNREG_REGION, we can
> + * ignore the request from restrictedmem.
> + */
> + return flush;

Sorry why "private-shared conversion is done via KVM_MEMORY_ENCRYPT_REG" results
in "we can ignore the requres from restrictedmem"?

If we punch a hole, the pages are de-allocated, correct?

> + else if (range->flags & KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR) {
> + if (range->attr == KVM_MEM_ATTR_SHARED)
> + zap_private = true;
> + else {
> + WARN_ON_ONCE(range->attr != KVM_MEM_ATTR_PRIVATE);
> + zap_private = false;
> + }
> + } else
> + /*
> + * kvm_unmap_gfn_range() is called via mmu
> + * notifier. For now page migration for private
> + * page isn't supported yet, don't zap private
> + * pages.
> + */
> + zap_private = false;

Page migration is not the only reason that KVM will receive the MMU notifer --
just say something like "for now all private pages are pinned during VM's life
time".


> + }
> + flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush, zap_private);
> + }
>
> return flush;
> }
> @@ -6066,11 +6096,48 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
> }
>
> +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> + bool flush = false;
> +
> + write_lock(&kvm->mmu_lock);
> +
> + /*
> + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> + * case scenario we'll have unused shadow pages lying around until they
> + * are recycled due to age or when the VM is destroyed.
> + */
> + if (is_tdp_mmu_enabled(kvm)) {
> + struct kvm_gfn_range range = {
> + .slot = slot,
> + .start = slot->base_gfn,
> + .end = slot->base_gfn + slot->npages,
> + .may_block = false,
> + };
> +
> + /*
> + * this handles both private gfn and shared gfn.
> + * All private page should be zapped on memslot deletion.
> + */
> + flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush, true);
> + } else {
> + flush = slot_handle_level(kvm, slot, __kvm_zap_rmap, PG_LEVEL_4K,
> + KVM_MAX_HUGEPAGE_LEVEL, true);
> + }
> + if (flush)
> + kvm_flush_remote_tlbs(kvm);
> +
> + write_unlock(&kvm->mmu_lock);
> +}
> +
> static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
> struct kvm_memory_slot *slot,
> struct kvm_page_track_notifier_node *node)
> {
> - kvm_mmu_zap_all_fast(kvm);
> + if (kvm_gfn_shared_mask(kvm))
> + kvm_mmu_zap_memslot(kvm, slot);
> + else
> + kvm_mmu_zap_all_fast(kvm);
> }

A comment would be nice here.

>
> int kvm_mmu_init_vm(struct kvm *kvm)
> @@ -6173,8 +6240,18 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>
> if (is_tdp_mmu_enabled(kvm)) {
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> + /*
> + * zap_private = true. Zap both private/shared pages.
> + *
> + * kvm_zap_gfn_range() is used when PAT memory type was

Is it PAT or MTRR, or both (thus just memory type)?

> + * changed. Later on the next kvm page fault, populate
> + * it with updated spte entry.
> + * Because only WB is supported for private pages, don't
> + * care of private pages.
> + */

Then why bother zapping private? If I read correctly, the changelog says "don't
zap private"?

> flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
> - gfn_end, true, flush);
> + gfn_end, true, flush,
> + true);
> }
>

Btw, as you mentioned in the changelog, private memory always has WB memory
type, thus cannot be virtualized. Is it better to modify update_mtrr() to just
return early if the gfn range is purely private?

IMHO the handling of MTRR/PAT virtualization for TDX guest deserves dedicated
patch(es) to put them together so it's easier to review. Now the relevant parts
spread in multiple independent patches (MSR handling, vt_get_mt_mask(), etc).