Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping

From: Ben Gardon
Date: Tue Mar 23 2021 - 12:27:25 EST


On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Mar 22, 2021, Ben Gardon wrote:
> > On Fri, Mar 19, 2021 at 4:20 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > @@ -5960,19 +5963,21 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
> > > lpage_disallowed_link);
> > > WARN_ON_ONCE(!sp->lpage_disallowed);
> > > if (is_tdp_mmu_page(sp)) {
> > > - kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn,
> > > - sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level));
> > > + gfn_end = sp->gfn + KVM_PAGES_PER_HPAGE(sp->role.level);
> > > + flush = kvm_tdp_mmu_zap_gfn_range(kvm, sp->gfn, gfn_end,
> > > + flush || !list_empty(&invalid_list));
> > > } else {
> > > kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> > > WARN_ON_ONCE(sp->lpage_disallowed);
> > > }
> > >
> > > if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> > > - kvm_mmu_commit_zap_page(kvm, &invalid_list);
> > > + kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
> >
> > This pattern of waiting until a yield is needed or lock contention is
> > detected has always been a little suspect to me because
> > kvm_mmu_commit_zap_page does work proportional to the work done before
> > the yield was needed. That seems like more work than we should like to
> > be doing at that point.
> >
> > The yield in kvm_tdp_mmu_zap_gfn_range makes that phenomenon even
> > worse. Because we can satisfy the need to yield without clearing out
> > the invalid list, we can potentially queue many more pages which will
> > then all need to have their zaps committed at once. This is an
> > admittedly contrived case which could only be hit in a high load
> > nested scenario.
> >
> > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
> > yielding. Since we should only need to zap one SPTE, the yield should
> > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
> > that only one SPTE is zapped we would have to specify the root though.
> > Otherwise we could end up zapping all the entries for the same GFN
> > range under an unrelated root.
>
> Hmm, I originally did exactly that, but changed my mind because this zaps far
> more than 1 SPTE. This is zapping a SP that could be huge, but is not, which
> means it's guaranteed to have a non-zero number of child SPTEs. The worst case
> scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs.

It's true that there are potentially 512^2 child sptes, but the code
to clear those after the single PUD spte is cleared doesn't yield
anyway. If the TDP MMU is only operating with one root (as we would
expect in most cases), there should only be one chance for it to
yield.

I've considered how we could allow the recursive changed spte handlers
to yield, but it gets complicated quite fast because the caller needs
to know if it yielded and reset the TDP iterator to the root, and
there are some cases (mmu notifiers + vCPU path) where yielding is not
desirable.

>
> But, I didn't consider the interplay between invalid_list and the TDP MMU
> yielding. Hrm.