Re: [PATCH 15/28] KVM: x86/mmu: Take TDP MMU roots off list when invalidating all roots

From: Sean Christopherson
Date: Mon Nov 22 2021 - 18:08:56 EST


On Mon, Nov 22, 2021, Ben Gardon wrote:
> On Fri, Nov 19, 2021 at 8:51 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > Take TDP MMU roots off the list of roots when they're invalidated instead
> > of walking later on to find the roots that were just invalidated. In
> > addition to making the flow more straightforward, this allows warning
> > if something attempts to elevate the refcount of an invalid root, which
> > should be unreachable (no longer on the list so can't be reached by MMU
> > notifier, and vCPUs must reload a new root before installing new SPTE).
> >
> > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
>
> There are a bunch of awesome little cleanups and unrelated fixes
> included in this commit that could be factored out.
>
> I'm skeptical of immediately moving the invalidated roots into another
> list as that seems like it has a lot of potential for introducing
> weird races.

I disagree, the entire premise of fast invalidate is that there can't be races,
i.e. mmu_lock must be held for write. IMO, it's actually the opposite, as the only
reason leaving roots on the per-VM list doesn't have weird races is because slots_lock
is held. If slots_lock weren't required to do a fast zap, which is feasible for the
TDP MMU since it doesn't rely on the memslots generation, then it would be possible
for multiple calls to kvm_tdp_mmu_zap_invalidated_roots() to run in parallel. And in
that case, leaving roots on the per-VM list would lead to a single instance of a
"fast zap" zapping roots it didn't invalidate. That's wouldn't be problematic per se,
but I don't like not having a clear "owner" of the invalidated root.

> I'm not sure it actually solves a problem either. Part of
> the motive from the commit description "this allows warning if
> something attempts to elevate the refcount of an invalid root" can be
> achieved already without moving the roots into a separate list.

Hmm, true in the sense that kvm_tdp_mmu_get_root() could be converted to a WARN,
but that would require tdp_mmu_next_root() to manually skip invalid roots.
kvm_tdp_mmu_get_vcpu_root_hpa() is naturally safe because page_role_for_level()
will never set the invalid flag.

I don't care too much about adding a manual check in tdp_mmu_next_root(), what I don't
like is that a WARN in kvm_tdp_mmu_get_root() wouldn't be establishing an invariant
that invalidated roots are unreachable, it would simply be forcing callers to check
role.invalid.

> Maybe this would seem more straightforward with some of the little
> cleanups factored out, but this feels more complicated to me.
> > @@ -124,6 +137,27 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> > {
> > struct kvm_mmu_page *next_root;
> >
> > + lockdep_assert_held(&kvm->mmu_lock);
> > +
> > + /*
> > + * Restart the walk if the previous root was invalidated, which can
> > + * happen if the caller drops mmu_lock when yielding. Restarting the
> > + * walke is necessary because invalidating a root also removes it from
>
> Nit: *walk
>
> > + * tdp_mmu_roots. Restarting is safe and correct because invalidating
> > + * a root is done if and only if _all_ roots are invalidated, i.e. any
> > + * root on tdp_mmu_roots was added _after_ the invalidation event.
> > + */
> > + if (prev_root && prev_root->role.invalid) {
> > + kvm_tdp_mmu_put_root(kvm, prev_root, shared);
> > + prev_root = NULL;
> > + }
> > +
> > + /*
> > + * Finding the next root must be done under RCU read lock. Although
> > + * @prev_root itself cannot be removed from tdp_mmu_roots because this
> > + * task holds a reference, its next and prev pointers can be modified
> > + * when freeing a different root. Ditto for tdp_mmu_roots itself.
> > + */
>
> I'm not sure this is correct with the rest of the changes in this
> patch. The new version of invalidate_roots removes roots from the list
> immediately, even if they have a non-zero ref-count.

Roots don't have to be invalidated to be removed, e.g. if the last reference is
put due to kvm_mmu_reset_context(). Or did I misunderstand?

> > rcu_read_lock();
> >
> > if (prev_root)
> > @@ -230,10 +264,13 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
> > refcount_set(&root->tdp_mmu_root_count, 1);
> >
> > - spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > - list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
> > - spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > -
> > + /*
> > + * Because mmu_lock must be held for write to ensure that KVM doesn't
> > + * create multiple roots for a given role, this does not need to use
> > + * an RCU-friendly variant as readers of tdp_mmu_roots must also hold
> > + * mmu_lock in some capacity.
> > + */
>
> I doubt we're doing it now, but in principle we could allocate new
> roots with mmu_lock in read + tdp_mmu_pages_lock. That might be better
> than depending on the write lock.

We're not, this function does lockdep_assert_held_write(&kvm->mmu_lock) a few
lines above. I don't have a preference between using mmu_lock.read+tdp_mmu_pages_lock
versus mmu_lock.write, but I do care that the current code doesn't incorrectly imply
that it's possible for something else to be walking the roots while this runs.

Either way, this should definitely be a separate patch, pretty sure I just lost
track of it.

> > + list_add(&root->link, &kvm->arch.tdp_mmu_roots);
> > out:
> > return __pa(root->spt);
> > }