Re: [PATCH v3] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

From: Yan Zhao
Date: Mon Feb 05 2024 - 22:08:03 EST


On Mon, Feb 05, 2024 at 10:27:32AM -0800, Sean Christopherson wrote:
> On Mon, Feb 05, 2024, Yan Zhao wrote:
> > On Fri, Feb 02, 2024 at 04:35:18PM -0800, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 3c193b096b45..8ce9898914f1 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4415,6 +4415,25 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > if (unlikely(!fault->slot))
> > > return kvm_handle_noslot_fault(vcpu, fault, access);
> > >
> > > + /*
> > > + * Pre-check for a relevant mmu_notifier invalidation event prior to
> > > + * acquiring mmu_lock. If there is an in-progress invalidation and the
> > > + * kernel allows preemption, the invalidation task may drop mmu_lock
> > > + * and yield in response to mmu_lock being contended, which is *very*
> > > + * counter-productive as this vCPU can't actually make forward progress
> > > + * until the invalidation completes. This "unsafe" check can get false
> > > + * negatives, i.e. KVM needs to re-check after acquiring mmu_lock.
> > > + *
> > > + * Do the pre-check even for non-preemtible kernels, i.e. even if KVM
> > > + * will never yield mmu_lock in response to contention, as this vCPU is
> > > + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> > > + * to detect retry guarantees the worst case latency for the vCPU.
> > > + */
> > > + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) {
> > > + kvm_release_pfn_clean(fault->pfn);
> > > + return RET_PF_RETRY;
> > > + }
> > > +
> > Could we also add this pre-check before fault in the pfn? like
> > @@ -4404,6 +4404,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >
> > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > smp_rmb();
> > + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
> > + return RET_PF_CONTINUE;
> >
> > ret = __kvm_faultin_pfn(vcpu, fault);
> > if (ret != RET_PF_CONTINUE)
> >
> > Though the mmu_seq would be always equal in the check, it can avoid repeated faulting
> > and release pfn for vain during a long zap cycle.
>
> Just to be super claer, by "repeated faulting", you mean repeated faulting in the
> primary MMU, correct?
>
Yes. Faulting in the primary MMU.
(Please ignore my typo in return type above :))

BTW, will you also send the optmization in v1 as below?
iff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e340098d034..c7617991e290 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5725,11 +5725,13 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
}

if (r == RET_PF_INVALID) {
- r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
- lower_32_bits(error_code), false,
- &emulation_type);
- if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
- return -EIO;
+ do {
+ r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa,
+ lower_32_bits(error_code),
+ false, &emulation_type);
+ if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
+ return -EIO;
+ while (r == RET_PF_RETRY);
}

if (r < 0)

> Yeah, I agree, that makes sense. The only question is whether to check before
> and after, or only before. When abusing KSM, I see ~99.5% of all invalidations
> being detected before (21.5M out of just over 21.6M).
>
> I think it makes sense to also check after getting the pfn? The check is super
> cheap, especially when there isn't an invalidation event as the checks should all
> be quite predictable and cache hot.
I think checking at before and after is reasonable if the cost of check is not a
concern.


> > > +/*
> > > + * This lockless version of the range-based retry check *must* be paired with a
> > > + * call to the locked version after acquiring mmu_lock, i.e. this is safe to
> > > + * use only as a pre-check to avoid contending mmu_lock. This version *will*
> > > + * get false negatives and false positives.
> > > + */
> > > +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
> > > + unsigned long mmu_seq,
> > > + gfn_t gfn)
> > > +{
> > > + /*
> > > + * Use READ_ONCE() to ensure the in-progress flag and sequence counter
> > > + * are always read from memory, e.g. so that checking for retry in a
> > > + * loop won't result in an infinite retry loop. Don't force loads for
> > > + * start+end, as the key to avoiding infinite retry loops is observing
> > > + * the 1=>0 transition of in-progress, i.e. getting false negatives
> > > + * due to stale start+end values is acceptable.
> > > + */
> > > + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
> > > + gfn >= kvm->mmu_invalidate_range_start &&
> > > + gfn < kvm->mmu_invalidate_range_end)
> > > + return true;
> > > +
> > Is a smp_rmb() before below check better, given this retry is defined in a header
> > for all platforms?
>
> No, because the unsafe check very deliberately doesn't provide any ordering
> guarantees. The READ_ONCE() is purely to ensure forward progress. And trying to
> provide ordering for mmu_invalidate_in_progress is rather nonsensical. The smp_rmb()
> in mmu_invalidate_retry() ensures the caller will see a different mmu_invalidate_seq
> or an elevated mmu_invalidate_in_progress.
>
> For this code, the order in which mmu_invalidate_seq and mmu_invalidate_in_progress
> are checked truly doesn't matter as the range-based checks are will get false
> negatives when performed outside of mmu_lock. And from a correctness perspective,
> there are zero constraints on when the checks are performed (beyond the existing
> constraints from the earlier smp_rmb() and acquiring mmu_lock).
>
> If an arch wants ordering guarantees, it can simply use mmu_invalidate_retry()
> without a gfn, which has the advantage of being safe outside of mmu_lock (the
> obvious disadvantage is that it's very imprecise).
Ok. It makes sense!