Re: [PATCH v3] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing

From: Sean Christopherson
Date: Mon Feb 05 2024 - 13:35:33 EST


On Mon, Feb 05, 2024, Yan Zhao wrote:
> On Fri, Feb 02, 2024 at 04:35:18PM -0800, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 3c193b096b45..8ce9898914f1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4415,6 +4415,25 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > if (unlikely(!fault->slot))
> > return kvm_handle_noslot_fault(vcpu, fault, access);
> >
> > + /*
> > + * Pre-check for a relevant mmu_notifier invalidation event prior to
> > + * acquiring mmu_lock. If there is an in-progress invalidation and the
> > + * kernel allows preemption, the invalidation task may drop mmu_lock
> > + * and yield in response to mmu_lock being contended, which is *very*
> > + * counter-productive as this vCPU can't actually make forward progress
> > + * until the invalidation completes. This "unsafe" check can get false
> > + * negatives, i.e. KVM needs to re-check after acquiring mmu_lock.
> > + *
> > + * Do the pre-check even for non-preemtible kernels, i.e. even if KVM
> > + * will never yield mmu_lock in response to contention, as this vCPU is
> > + * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held
> > + * to detect retry guarantees the worst case latency for the vCPU.
> > + */
> > + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) {
> > + kvm_release_pfn_clean(fault->pfn);
> > + return RET_PF_RETRY;
> > + }
> > +
> Could we also add this pre-check before fault in the pfn? like
> @@ -4404,6 +4404,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>
> fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> smp_rmb();
> + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn))
> + return RET_PF_CONTINUE;
>
> ret = __kvm_faultin_pfn(vcpu, fault);
> if (ret != RET_PF_CONTINUE)
>
> Though the mmu_seq would be always equal in the check, it can avoid repeated faulting
> and release pfn for vain during a long zap cycle.

Just to be super claer, by "repeated faulting", you mean repeated faulting in the
primary MMU, correct?

Yeah, I agree, that makes sense. The only question is whether to check before
and after, or only before. When abusing KSM, I see ~99.5% of all invalidations
being detected before (21.5M out of just over 21.6M).

I think it makes sense to also check after getting the pfn? The check is super
cheap, especially when there isn't an invalidation event as the checks should all
be quite predictable and cache hot.

> > +/*
> > + * This lockless version of the range-based retry check *must* be paired with a
> > + * call to the locked version after acquiring mmu_lock, i.e. this is safe to
> > + * use only as a pre-check to avoid contending mmu_lock. This version *will*
> > + * get false negatives and false positives.
> > + */
> > +static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
> > + unsigned long mmu_seq,
> > + gfn_t gfn)
> > +{
> > + /*
> > + * Use READ_ONCE() to ensure the in-progress flag and sequence counter
> > + * are always read from memory, e.g. so that checking for retry in a
> > + * loop won't result in an infinite retry loop. Don't force loads for
> > + * start+end, as the key to avoiding infinite retry loops is observing
> > + * the 1=>0 transition of in-progress, i.e. getting false negatives
> > + * due to stale start+end values is acceptable.
> > + */
> > + if (unlikely(READ_ONCE(kvm->mmu_invalidate_in_progress)) &&
> > + gfn >= kvm->mmu_invalidate_range_start &&
> > + gfn < kvm->mmu_invalidate_range_end)
> > + return true;
> > +
> Is a smp_rmb() before below check better, given this retry is defined in a header
> for all platforms?

No, because the unsafe check very deliberately doesn't provide any ordering
guarantees. The READ_ONCE() is purely to ensure forward progress. And trying to
provide ordering for mmu_invalidate_in_progress is rather nonsensical. The smp_rmb()
in mmu_invalidate_retry() ensures the caller will see a different mmu_invalidate_seq
or an elevated mmu_invalidate_in_progress.

For this code, the order in which mmu_invalidate_seq and mmu_invalidate_in_progress
are checked truly doesn't matter as the range-based checks are will get false
negatives when performed outside of mmu_lock. And from a correctness perspective,
there are zero constraints on when the checks are performed (beyond the existing
constraints from the earlier smp_rmb() and acquiring mmu_lock).

If an arch wants ordering guarantees, it can simply use mmu_invalidate_retry()
without a gfn, which has the advantage of being safe outside of mmu_lock (the
obvious disadvantage is that it's very imprecise).