Re: [PATCH 1/2] KVM: Allow calling mmu_invalidate_retry_hva() without holding mmu_lock

From: Sean Christopherson
Date: Fri Sep 08 2023 - 13:37:17 EST


On Thu, Sep 07, 2023, Kai Huang wrote:
> On Thu, 2023-08-24 at 19:07 -0700, Sean Christopherson wrote:
> > Allow checking mmu_invalidate_retry_hva() without holding mmu_lock, even
> > though mmu_lock must be held to guarantee correctness, i.e. to avoid
> > false negatives. Dropping the requirement that mmu_lock be held will
> > allow pre-checking for retry before acquiring mmu_lock, e.g. to avoid
> > contending mmu_lock when the guest is accessing a range that is being
> > invalidated by the host.
> >
> > Contending mmu_lock can have severe negative side effects for x86's TDP
> > MMU when running on preemptible kernels, as KVM will yield from the
> > zapping task (holds mmu_lock for write) when there is lock contention,
> > and yielding after any SPTEs have been zapped requires a VM-scoped TLB
> > flush.
> >
> > Wrap mmu_invalidate_in_progress in READ_ONCE() to ensure that calling
> > mmu_invalidate_retry_hva() in a loop won't put KVM into an infinite loop,
> > e.g. due to caching the in-progress flag and never seeing it go to '0'.
> >
> > Force a load of mmu_invalidate_seq as well, even though it isn't strictly
> > necessary to avoid an infinite loop, as doing so improves the probability
> > that KVM will detect an invalidation that already completed before
> > acquiring mmu_lock and bailing anyways.
>
> Without the READ_ONCE() on mmu_invalidate_seq, with patch 2 and
> mmu_invalidate_retry_hva() inlined IIUC the kvm_faultin_pfn() can look like
> this:
>
> fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; <-- (1)
> smp_rmb();
>
> ...
> READ_ONCE(vcpu->kvm->mmu_invalidate_in_progress);
> ...
>
> if (vcpu->kvm->mmu_invalidate_seq != fault->mmu_seq) <-- (2)
> ...
>
> Perhaps stupid question :-) Will compiler even believes both vcpu->kvm-
> >mmu_invaludate_seq and fault->mmu_seq are never changed thus eliminates the
> code in 1) and 2)? Or all the barriers between are enough to prevent compiler

Practically speaking, no, there's far too much going on in __kvm_faultin_pfn().

But, KVM _could_ do the freshness check before __kvm_faultin_pfn() since KVM
has the memslot and thus the host virtual addess at that point. I highly doubt
we'll ever do that, but it's possible. At that point, there'd be no spinlocks
or other barries to ensure the load+check wouldn't get elided. That's still
extremely theoretical though.

1) can't be eliminated because acquiring mmu_lock provides enough barries to
prevent the compiler from omitting the load, i.e. the compiler can't omit the
comparison that done inside the critical section. (2) can theoretically be optimized
away by the compiler (when called before acquiring mmu_lock), though it's extremely
unlikely since there's sooo much going on between the load and the check.

> > Note, adding READ_ONCE() isn't entirely free, e.g. on x86, the READ_ONCE()
> > may generate a load into a register instead of doing a direct comparison
> > (MOV+TEST+Jcc instead of CMP+Jcc), but practically speaking the added cost
> > is a few bytes of code and maaaaybe a cycle or three.

...

> > - if (kvm->mmu_invalidate_seq != mmu_seq)
> > +
> > + if (READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq)
> > return 1;
> > return 0;
> > }
>
> I am not sure how mmu_invalidate_retry_hva() can be called in a loop so looks
> all those should be theoretical thing, but the extra cost should be nearly empty
> as you said.

It's not currently called in a loop, but it wouldn't be wrong for KVM to do
something like the below instead of fully re-entering the guest, in which case
mmu_invalidate_retry_hva() would be called in a loop, just a big loop :-)

And if KVM checked for freshness before resolving the PFN, it'd be possible for
the loop to read and check the sequence in the loop without any barriers that would
guarantee a reload (again, very, very theoretically).

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e340098d034..c7617991e290 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5725,11 +5725,13 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
}

if (r == RET_PF_INVALID) {
- r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
- lower_32_bits(error_code), false,
- &emulation_type);
- if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
- return -EIO;
+ do {
+ r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
+ lower_32_bits(error_code),
+ false, &emulation_type);
+ if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
+ return -EIO;
+ while (r == RET_PF_RETRY);
}

if (r < 0)