Re: [PATCH 2/2] KVM: mmu/x86: optimize zapping by retaining non-leaf SPTEs and avoid rcu stall

From: David Matlack
Date: Mon Nov 29 2021 - 19:50:21 EST


On Tue, Nov 30, 2021 at 12:48:24AM +0000, David Matlack wrote:
> On Wed, Nov 24, 2021 at 09:44:21PM +0000, Mingwei Zhang wrote:
> > TDP MMU SPTE zapping process currently uses two levels of iterations. The
> > first level iteration happens at the for loop within the zap_gfn_range()
> > with the purpose of calibrating the accurate range for zapping. The second
> > level itreration start at tdp_mmu_set_spte{,_atomic}() that tears down the
>
> iteration
>
> > whole paging structures (leaf and non-leaf SPTEs) within the range. The
> > former iteration is yield safe, while the second one is not.
>
> I know what you mean but I'd suggest being more specific than "yield
> safe". For example:
>
> Unlike the outer loop, the recursive zapping done under
> tdp_mmu_set_spte{,_atomic} does not yield. Since zapping is done with
> a pre-order traversal, zapping sufficiently large ranges can lead to
> RCU stall warnings.
>
> I'd also clarify here that the TDP MMU iterator uses a pre-order
> traversal which causes us KVM to end up doing the maximum amount of
> zapping under tdp_mmu_set_spte{,_atomic} and not the outer for loop.

(Ah sorry for the redundant suggestion. I wrote this paragraph and then
reworded my suggested wording to mention the pre-order traversal.)

>
> >
> > In many cases, zapping SPTE process could be optimized since the non-leaf
> > SPTEs could most likely be retained for the next allocation. On the other
> > hand, for large scale SPTE zapping scenarios, we may end up zapping too
> > many SPTEs and use excessive CPU time that causes the RCU stall warning.
> >
> > The follow selftest reproduces the warning:
> >
> > (env: kvm.tdp_mmu=Y)
> > ./dirty_log_perf_test -v 64 -b 8G
> >
> > Optimize the zapping process by skipping all SPTEs above a certain level in
> > the first iteration. This allows us to control the granularity of the
> > actual zapping and invoke tdp_mmu_iter_cond_resched() on time. In addition,
> > we would retain some of the non-leaf SPTEs to accelerate next allocation.
> >
> > For the selection of the `certain level`, we choose the PG_LEVEL_1G because
> > it is currently the largest page size supported and it natually fits the
> > scenario of splitting large pages.
> >
> > For `zap_all` case (usually) at VM teardown time, we use a two-phase
> > mechanism: the 1st phase zaps all SPTEs at PG_LEVEL_1G level and 2nd phase
> > zaps everything else. This is achieved by the helper function
> > __zap_gfn_range().
> >
> > Cc: Sean Christopherson <seanjc@xxxxxxxxxx>
> > Cc: Ben Gardon <bgardon@xxxxxxxxxx>
> > Cc: David Matlack <dmatlack@xxxxxxxxxx>
> >
> > Signed-off-by: Mingwei Zhang <mizhang@xxxxxxxxxx>
> > ---
> > arch/x86/kvm/mmu/tdp_mmu.c | 57 ++++++++++++++++++++++++++------------
> > 1 file changed, 40 insertions(+), 17 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 89d16bb104de..3fadc51c004a 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -697,24 +697,16 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> > * account for the possibility that other threads are modifying the paging
> > * structures concurrently. If shared is false, this thread should hold the
> > * MMU lock in write mode.
> > + *
> > + * If zap_all is true, eliminate all the paging structures that contains the
> > + * SPTEs.
> > */
> > -static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > - gfn_t start, gfn_t end, bool can_yield, bool flush,
> > - bool shared)
> > +static bool __zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > + gfn_t start, gfn_t end, bool can_yield, bool flush,
> > + bool shared, bool zap_all)
> > {
> > - gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> > - bool zap_all = (start == 0 && end >= max_gfn_host);
> > struct tdp_iter iter;
> >
> > - /*
> > - * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> > - * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> > - * and so KVM will never install a SPTE for such addresses.
> > - */
> > - end = min(end, max_gfn_host);
> > -
> > - kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> > -
> > rcu_read_lock();
> >
> > tdp_root_for_each_pte(iter, root, start, end) {
> > @@ -725,17 +717,24 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > continue;
> > }
> >
> > - if (!is_shadow_present_pte(iter.old_spte))
> > + /*
> > + * In zap_all case, ignore the checking of present since we have
> > + * to zap everything.
> > + */
> > + if (!zap_all && !is_shadow_present_pte(iter.old_spte))
> > continue;
>
> I don't believe there's any reason to attempt to zap a non-present spte,
> even in the zap_all case. In any case, this change deserves its own
> patch and a commit message that describes why the old logic is incorrect
> and how this fixes it.
>
> >
> > /*
> > * If this is a non-last-level SPTE that covers a larger range
> > * than should be zapped, continue, and zap the mappings at a
> > - * lower level, except when zapping all SPTEs.
> > + * lower level. Actual zapping started at proper granularity
> > + * that is not so large as to cause a soft lockup when handling
> > + * the changed pte (which does not yield).
> > */
> > if (!zap_all &&
> > (iter.gfn < start ||
> > - iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end) &&
> > + iter.gfn + KVM_PAGES_PER_HPAGE(iter.level) > end ||
> > + iter.level > PG_LEVEL_1G) &&
> > !is_last_spte(iter.old_spte, iter.level))
> > continue;
>
> This if statement is getting a bit long. I'd suggest breaking out the
> level check and also using KVM_MAX_HUGEPAGE_LEVEL.
>
> e.g.
>
> /*
> * If not doing zap_all, only zap up to the huge page level to
> * avoid doing too much work in the recursive tdp_mmu_set_spte*
> * call below, since it does not yield.
> *
> * This will potentially leave behind some childless page tables
> * but that's ok because ...
> */
> if (!zap_all && iter.level > KVM_MAX_HUGEPAGE_LEVEL)
> continue;
>
> And on that note, what is the reasoning for why it's ok to leave behind
> childless page tables? I assume it's because most of the time we'll use
> that page table again in the future, and at worst we leave the page
> table allocated until the VM is cleaned up?
>
> >
> > @@ -756,6 +755,30 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > return flush;
> > }
> >
> > +static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> > + gfn_t start, gfn_t end, bool can_yield, bool flush,
> > + bool shared)
> > +{
> > + gfn_t max_gfn_host = 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> > + bool zap_all = (start == 0 && end >= max_gfn_host);
> > +
> > + /*
> > + * Bound the walk at host.MAXPHYADDR, guest accesses beyond that will
> > + * hit a #PF(RSVD) and never get to an EPT Violation/Misconfig / #NPF,
> > + * and so KVM will never install a SPTE for such addresses.
> > + */
> > + end = min(end, max_gfn_host);
> > +
> > + kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> > +
> > + flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush, shared,
> > + false);
> > + if (zap_all)
> > + flush = __zap_gfn_range(kvm, root, start, end, can_yield, flush,
> > + shared, true);
> > + return flush;
> > +}
> > +
> > /*
> > * Tears down the mappings for the range of gfns, [start, end), and frees the
> > * non-root pages mapping GFNs strictly within that range. Returns true if
> > --
> > 2.34.0.rc2.393.gf8c9666880-goog
> >