Re: [PATCH 2/2] KVM: x86/mmu: Ensure TLBs are flushed when yielding during NX zapping

From: Sean Christopherson
Date: Thu Mar 25 2021 - 15:16:29 EST


On Tue, Mar 23, 2021, Ben Gardon wrote:
> On Tue, Mar 23, 2021 at 11:58 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Tue, Mar 23, 2021, Ben Gardon wrote:
> > > On Mon, Mar 22, 2021 at 5:15 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Mar 22, 2021, Ben Gardon wrote:
> > > > > It could be fixed by forbidding kvm_tdp_mmu_zap_gfn_range from
> > > > > yielding. Since we should only need to zap one SPTE, the yield should
> > > > > not be needed within the kvm_tdp_mmu_zap_gfn_range call. To ensure
> > > > > that only one SPTE is zapped we would have to specify the root though.
> > > > > Otherwise we could end up zapping all the entries for the same GFN
> > > > > range under an unrelated root.
> > > >
> > > > Hmm, I originally did exactly that, but changed my mind because this zaps far
> > > > more than 1 SPTE. This is zapping a SP that could be huge, but is not, which
> > > > means it's guaranteed to have a non-zero number of child SPTEs. The worst case
> > > > scenario is that SP is a PUD (potential 1gb page) and the leafs are 4k SPTEs.
> > >
> > > It's true that there are potentially 512^2 child sptes, but the code
> > > to clear those after the single PUD spte is cleared doesn't yield
> > > anyway. If the TDP MMU is only operating with one root (as we would
> > > expect in most cases), there should only be one chance for it to
> > > yield.
> >
> > Ah, right, I was thinking all the iterative flows yielded. Disallowing
> > kvm_tdp_mmu_zap_gfn_range() from yielding in this case does seem like the best
> > fix. Any objection to me sending v2 with that?
>
> That sounds good to me.

Ewww. This analysis isn't 100% accurate. It's actually impossible for
zap_gfn_range() to yield in this case. Even though it may walk multiple roots
and levels, "yielded_gfn == next_last_level_gfn" will hold true until the iter
attempts to step sideways. When the iter attempts to step sideways, it will
always do so at the level that matches the zapping level, and so will always
step past "end". Thus, tdp_root_for_each_pte() will break without ever
yielding.

That being said, I'm still going to send a patch to explicitly prevent this path
from yielding. Relying on the above is waaaay too subtle and fragile.

> > > I've considered how we could allow the recursive changed spte handlers
> > > to yield, but it gets complicated quite fast because the caller needs
> > > to know if it yielded and reset the TDP iterator to the root, and
> > > there are some cases (mmu notifiers + vCPU path) where yielding is not
> > > desirable.
> >
> > Urgh, yeah, seems like we'd quickly end up with a mess resembling the legacy MMU
> > iterators.
> >
> > > >
> > > > But, I didn't consider the interplay between invalid_list and the TDP MMU
> > > > yielding. Hrm.