Re: [PATCH v3 2/5] KVM: pfncache: Implement range-based invalidation check for hva_to_pfn_retry()

From: Sean Christopherson
Date: Tue Oct 15 2024 - 20:05:05 EST


On Wed, Aug 21, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@xxxxxxxxxxxx>
>
> The existing retry loop in hva_to_pfn_retry() is extremely pessimistic.
> If there are any concurrent invalidations running, it's effectively just
> a complex busy wait loop because its local mmu_notifier_retry_cache()
> function will always return true.
>
> Since multiple invalidations can be running in parallel, this can result
> in a situation where hva_to_pfn_retry() just backs off and keep retrying
> for ever, not making any progress.
>
> Solve this by being a bit more selective about when to retry. In
> addition to the ->needs_invalidation flag added in a previous commit,
> check before calling hva_to_pfn() if there is any pending invalidation
> (i.e. between invalidate_range_start() and invalidate_range_end()) which
> affects the uHVA of the GPC being updated. This catches the case where
> hva_to_pfn() would return a soon-to-be-stale mapping of a range for which
> the invalidate_range_start() hook had already been called before the uHVA
> was even set in the GPC and the ->needs_invalidation flag set.
>
> An invalidation which *starts* after the lock is dropped in the loop is
> fine because it will clear the ->needs_revalidation flag and also trigger
> a retry.
>
> This is slightly more complex than it needs to be; moving the
> invalidation to occur in the invalidate_range_end() hook will make life
> simpler, in a subsequent commit.

I'm pretty sure that's still more complex than it needs to be. And even if the
gpc implementation isn't any more/less complex, diverging from the page fault
side of things makes KVM overall more complex.

Maybe I'm just not seeing some functionality that needs_invalidation provides,
but I still don't understand why it's necessary.

Omitting context to keep this short, I think the below is what we can end up at;
see attached (very lightly tested) patches for the full idea. So long as either
the invalidation side marks the gpc invalid, or the refresh side is guaranteed
to check for an invalidation, I don't see why needs_invalidation is required.

If the cache is invalid, or the uhva is changing, then the event doesn't need to
do anything because refresh() will hva_to_pfn_retry() to re-resolve the pfn. So
at that point, KVM just needs to ensure either hva_to_pfn_retry() observes the
in-progress range, or the event sees valid==true.

Note, the attached patches include the "wait" idea, but I'm not convinced that's
actually a good idea. I'll respond more to your last patch.

gfn_to_pfn_cache_invalidate_start():

...

/*
* Ensure that either each cache sees the to-be-invalidated range and
* retries if necessary, or that this task sees the cache's valid flag
* and invalidates the cache before completing the mmu_notifier event.
* Note, gpc->uhva must be set before gpc->valid, and if gpc->uhva is
* modified the cache must be re-validated. Pairs with the smp_mb() in
* hva_to_pfn_retry().
*/
smp_mb__before_atomic();

spin_lock(&kvm->gpc_lock);
list_for_each_entry(gpc, &kvm->gpc_list, list) {
if (!gpc->valid || gpc->uhva < start || gpc->uhva >= end)
continue;

write_lock_irq(&gpc->lock);

/*
* Verify the cache still needs to be invalidated after
* acquiring gpc->lock, to avoid an unnecessary invalidation
* in the unlikely scenario the cache became valid with a
* different userspace virtual address.
*/
if (gpc->valid && gpc->uhva >= start && gpc->uhva < end)
gpc->valid = false;
write_unlock_irq(&gpc->lock);
}
spin_unlock(&kvm->gpc_lock);

hva_to_pfn_retry():

do {
/*
* Invalidate the cache prior to dropping gpc->lock, the
* gpa=>uhva assets have already been updated and so a check()
* from a different task may not fail the gpa/uhva/generation
* checks, i.e. must observe valid==false.
*/
gpc->valid = false;

write_unlock_irq(&gpc->lock);

...

write_lock_irq(&gpc->lock);

/*
* Other tasks must wait for _this_ refresh to complete before
* attempting to refresh.
*/
WARN_ON_ONCE(gpc->valid);
gpc->valid = true;

/*
* Ensure valid is visible before checking if retry is needed.
* Pairs with the smp_mb__before_atomic() in
* gfn_to_pfn_cache_invalidate().
*/
smp_mb();
} while (gpc_invalidate_retry_hva(gpc, mmu_seq));

> @@ -193,6 +174,29 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>
> write_unlock_irq(&gpc->lock);
>
> + /*
> + * Invalidation occurs from the invalidate_range_start() hook,
> + * which could already have happened before __kvm_gpc_refresh()
> + * (or the previous turn around this loop) took gpc->lock().
> + * If so, and if the corresponding invalidate_range_end() hook
> + * hasn't happened yet, hva_to_pfn() could return a mapping
> + * which is about to be stale and which should not be used. So
> + * check if there are any currently-running invalidations which
> + * affect the uHVA of this GPC, and retry if there are. Any
> + * invalidation which starts after gpc->needs_invalidation is
> + * set is fine, because it will clear that flag and trigger a
> + * retry. And any invalidation which *completes* by having its
> + * invalidate_range_end() hook called immediately prior to this
> + * check is also fine, because the page tables are guaranteed
> + * to have been changted already, so hva_to_pfn() won't return

s/changted/changed

> + * a stale mapping in that case anyway.
> + */
> + while (gpc_invalidations_pending(gpc)) {
> + cond_resched();
> + write_lock_irq(&gpc->lock);
> + continue;

Heh, our testcase obviously isn't as robust as we hope it is. Or maybe only the
"wait" version was tested. The "continue" here will continue the
while(gpc_invalidations_pending) loop, not the outer while loop. That will cause
deadlock due to trying acquiring gpc->lock over and over.

> + }
> +
> /*
> * If the previous iteration "failed" due to an mmu_notifier
> * event, release the pfn and unmap the kernel virtual address
> @@ -233,6 +237,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
> goto out_error;
> }
>
> +

Spurious whitespace.

> write_lock_irq(&gpc->lock);
>
> /*