Re: [PATCH 19/28] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery

From: Sean Christopherson
Date: Mon Nov 22 2021 - 20:16:13 EST


On Mon, Nov 22, 2021, Ben Gardon wrote:
> On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > @@ -755,6 +759,26 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> > return false;
> > }
> >
> > +bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> > +{
> > + u64 old_spte;
> > +
> > + rcu_read_lock();
> > +
> > + old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
> > + if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte))) {
> > + rcu_read_unlock();
> > + return false;
> > + }
> > +
> > + __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> > + sp->gfn, sp->role.level + 1, true, true);
> > +
> > + rcu_read_unlock();
> > +
> > + return true;
> > +}
> > +
>
> Ooooh this makes me really nervous. There are a lot of gotchas to
> modifying SPTEs in a new context without traversing the paging
> structure like this. For example, we could modify an SPTE under an
> invalidated root here. I don't think that would be a problem since
> we're just clearing it, but it makes the code more fragile.

Heh, it better not be a problem, because there are plently of flows in the TDP MMU
that can modify SPTEs under an invalidated root, e.g. fast_page_fault(),
tdp_mmu_zap_leafs(), kvm_age_gfn(), kvm_test_age_gfn(), etc... And before the
patch that introduced is_page_fault_stale(), kvm_tdp_mmu_map() was even installing
SPTEs into an invalid root! Anything that takes a reference to a root and yields
(or never takes mmu_lock) can potentially modify a SPTE under an invalid root.

Checking the paging structures for this flow wouldn't change anything. Invalidating
a root doesn't immediately zap SPTEs, it just marks the root invalid. The other
subtle gotcha is that kvm_reload_remote_mmus() doesn't actually gaurantee all vCPUs
will have dropped the invalid root or performed a TLB flush when mmu_lock is dropped,
those guarantees are only with respect to re-entering the guest!

All of the above is no small part of why I don't want to walk the page tables:
it's completely misleading as walking the page tables doesn't actually provide any
protection, it's holding RCU that guarantees KVM doesn't write memory it doesn't own.

> Another approach to this would be to do in-place promotion / in-place
> splitting once the patch sets David and I just sent out are merged. That
> would avoid causing extra page faults here to bring in the page after this
> zap, but it probably wouldn't be safe if we did it under an invalidated root.

I agree that in-place promotion would be better, but if we do that, I think a logical
intermediate step would be to stop zapping unrelated roots and entries. If there's
a bug that is exposed/introduced by not zapping other stuff, I would much rather it
show up when KVM stops zapping other stuff, not when KVM stops zapping other stuff
_and_ promotes in place. Ditto for if in-place promotion introduces a bug.

> I'd rather avoid this extra complexity and just tolerate the worse
> performance on the iTLB multi hit mitigation at this point since new
> CPUs seem to be moving past that vulnerability.

IMO, it reduces complexity, especially when looking at the series as a whole, which
I fully realize you haven't yet done :-) Setting aside the complexities of each
chunk of code, what I find complex with the current TDP MMU zapping code is that
there are no precise rules for what needs to be done in each situation. I'm not
criticizing how we got to this point, I absolutely think that hitting everything
with a big hammer to get the initial version stable was the right thing to do.

But a side effect of the big hammer approach is that it makes reasoning about things
more difficult, e.g. "when is it safe to modify a SPTE versus when is it safe to insert
a SPTE into the paging structures?" or "what needs to be zapped when the mmu_notifier
unmaps a range?".

And I also really want to avoid another snafu like the memslots with passthrough
GPUs bug, where using a big hammer (zap all) when a smaller hammer (zap SPTEs for
the memslot) _should_ work allows bugs to creep in unnoticed because they're hidden
by overzealous zapping.

> If you think this is worth the complexity, it'd be nice to do a little
> benchmarking to make sure it's giving us a substantial improvement.

Performance really isn't a motivating factor. Per the changelog, the motivation
is mostly to allow later patches to simplify zap_gfn_range() by having it zap only
leaf SPTEs, and now that I've typed it up, also all of the above :-)