Re: [PATCH 15/15] KVM: x86/mmu: Promote pages in-place when disabling dirty logging

From: Peter Xu
Date: Wed Nov 24 2021 - 23:21:11 EST


Hi, Ben,

On Mon, Nov 15, 2021 at 03:46:03PM -0800, Ben Gardon wrote:
> When disabling dirty logging, the TDP MMU currently zaps each leaf entry
> mapping memory in the relevant memslot. This is very slow. Doing the zaps
> under the mmu read lock requires a TLB flush for every zap and the
> zapping causes a storm of ETP/NPT violations.
>
> Instead of zapping, replace the split large pages with large page
> mappings directly. While this sort of operation has historically only
> been done in the vCPU page fault handler context, refactorings earlier
> in this series and the relative simplicity of the TDP MMU make it
> possible here as well.

Thanks for this patch, it looks very useful.

I've got a few comments below, but before that I've also got one off-topic
question too; it'll be great if you can help answer.

When I was looking into how the old code recovers the huge pages I found that
we'll leave the full-zero pgtable page there until the next page fault, then I
_think_ it'll be released only until the __handle_changed_spte() when we're
dropping the old spte (handle_removed_tdp_mmu_page).

As comment above handle_removed_tdp_mmu_page() showed, at this point IIUC
current thread should have exclusive ownership of this orphaned and abandoned
pgtable page, then why in handle_removed_tdp_mmu_page() we still need all the
atomic operations and REMOVED_SPTE tricks to protect from concurrent access?
Since that's cmpxchg-ed out of the old pgtable, what can be accessing it
besides the current thread?

>
> Running the dirty_log_perf_test on an Intel Skylake with 96 vCPUs and 1G
> of memory per vCPU, this reduces the time required to disable dirty
> logging from over 45 seconds to just over 1 second. It also avoids
> provoking page faults, improving vCPU performance while disabling
> dirty logging.
>
>
> Signed-off-by: Ben Gardon <bgardon@xxxxxxxxxx>
> ---
> arch/x86/kvm/mmu/mmu.c | 2 +-
> arch/x86/kvm/mmu/mmu_internal.h | 4 ++
> arch/x86/kvm/mmu/tdp_mmu.c | 69 ++++++++++++++++++++++++++++++++-
> 3 files changed, 72 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index ef7a84422463..add724aa9e8c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4449,7 +4449,7 @@ static inline bool boot_cpu_is_amd(void)
> * the direct page table on host, use as much mmu features as
> * possible, however, kvm currently does not do execution-protection.
> */
> -static void
> +void
> build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> int shadow_root_level)
> {
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 6563cce9c438..84d439432acf 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -161,4 +161,8 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
> +void
> +build_tdp_shadow_zero_bits_mask(struct rsvd_bits_validate *shadow_zero_check,
> + int shadow_root_level);
> +
> #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 43c7834b4f0a..b15c8cd11cf9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1361,6 +1361,66 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
> clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
> }
>
> +static void try_promote_lpage(struct kvm *kvm,
> + const struct kvm_memory_slot *slot,
> + struct tdp_iter *iter)
> +{
> + struct kvm_mmu_page *sp = sptep_to_sp(iter->sptep);
> + struct rsvd_bits_validate shadow_zero_check;
> + /*
> + * Since the TDP MMU doesn't manage nested PTs, there's no need to
> + * write protect for a nested VM when PML is in use.
> + */
> + bool ad_need_write_protect = false;

Shall we just pass in "false" in make_spte() and just move the comment there?

> + bool map_writable;
> + kvm_pfn_t pfn;
> + u64 new_spte;
> + u64 mt_mask;
> +
> + /*
> + * If addresses are being invalidated, don't do in-place promotion to
> + * avoid accidentally mapping an invalidated address.
> + */
> + if (unlikely(kvm->mmu_notifier_count))
> + return;
> +
> + pfn = __gfn_to_pfn_memslot(slot, iter->gfn, true, NULL, true,
> + &map_writable, NULL);

Should we better check pfn validity and bail out otherwise? E.g. for atomic I
think we can also get KVM_PFN_ERR_FAULT when fast-gup failed somehow.

> +
> + /*
> + * Can't reconstitute an lpage if the consituent pages can't be
> + * mapped higher.
> + */
> + if (iter->level > kvm_mmu_max_mapping_level(kvm, slot, iter->gfn,
> + pfn, PG_LEVEL_NUM))
> + return;
> +
> + build_tdp_shadow_zero_bits_mask(&shadow_zero_check, iter->root_level);
> +
> + /*
> + * In some cases, a vCPU pointer is required to get the MT mask,
> + * however in most cases it can be generated without one. If a
> + * vCPU pointer is needed kvm_x86_try_get_mt_mask will fail.
> + * In that case, bail on in-place promotion.
> + */
> + if (unlikely(!static_call(kvm_x86_try_get_mt_mask)(kvm, iter->gfn,
> + kvm_is_mmio_pfn(pfn),
> + &mt_mask)))
> + return;
> +
> + make_spte(kvm, sp, slot, ACC_ALL, iter->gfn, pfn, 0, false, true,
> + map_writable, ad_need_write_protect, mt_mask,
> + &shadow_zero_check, &new_spte);
> +
> + tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
> +
> + /*
> + * Re-read the SPTE to avoid recursing into one of the removed child
> + * page tables.
> + */
> + iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));

Is this redundant since it seems the iterator logic handles this already, I'm
reading try_step_down() here:

/*
* Reread the SPTE before stepping down to avoid traversing into page
* tables that are no longer linked from this entry.
*/
iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));

The rest looks good to me, thanks.

> +}
> +
> /*
> * Clear leaf entries which could be replaced by large mappings, for
> * GFNs within the slot.
> @@ -1381,9 +1441,14 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
> if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> continue;
>
> - if (!is_shadow_present_pte(iter.old_spte) ||
> - !is_last_spte(iter.old_spte, iter.level))
> + if (!is_shadow_present_pte(iter.old_spte))
> + continue;
> +
> + /* Try to promote the constitutent pages to an lpage. */
> + if (!is_last_spte(iter.old_spte, iter.level)) {
> + try_promote_lpage(kvm, slot, &iter);
> continue;
> + }
>
> pfn = spte_to_pfn(iter.old_spte);
> if (kvm_is_reserved_pfn(pfn) ||
> --
> 2.34.0.rc1.387.gb447b232ab-goog
>

--
Peter Xu