Re: [PATCH v3 4/5] KVM: pfncache: Move invalidation to invalidate_range_end hook

From: Sean Christopherson
Date: Tue Oct 15 2024 - 20:17:25 EST


On Wed, Aug 21, 2024, David Woodhouse wrote:
> From: David Woodhouse <dwmw@xxxxxxxxxxxx>
>
> By performing the invalidation from the 'end' hook, the process is a lot
> cleaner and simpler because it is guaranteed that ->needs_invalidation
> will be cleared after hva_to_pfn() has been called, so the only thing
> that hva_to_pfn_retry() needs to do is *wait* for there to be no pending
> invalidations.
>
> Another false positive on the range based checks is thus removed as well.

...


> @@ -261,6 +239,14 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
> goto out_error;
> }
>
> + /*
> + * Avoid populating a GPC with a user address which is already
> + * being invalidated, if the invalidate_range_start() notifier
> + * has already been called. The actual invalidation happens in
> + * the invalidate_range_end() callback, so wait until there are
> + * no active invalidations (for the relevant HVA).
> + */
> + gpc_wait_for_invalidations(gpc);

I'm not convinced scheduling out the vCPU is actually a good idea. At first
blush, it sounds good, e.g. why consume CPU cycles when forward progress is
blocked?

But scheduling out the vCPU will likely negatively impact latency, and KVM can't
predict when the invalidation will complete. E.g. the refresh() could have come
along at the start of the invalidation, but it also could have arrived at the
tail end.

And if the vCPU is the only runnable task on the CPU, and the system is under enough
load to trigger an invalidation, then scheduling out won't provide any benefit
because odds are good the processor won't be able to get into a deep enough sleep
state to allow other logical CPUs to hit higher turbo bins.

The wait+schedule() logic for the memslots update is purely about being deliberately
_unfair_ to avoid a deadlock (the original idea was to simply use a r/w semapahore
to block memslot updates, but lock fairness lead to a possible deadlock).

If we want to not busy wait, then we should probably do something along the lines
of halt-polling, e.g. busy wait for a bit and then schedule() out. But even that
would require tuning, and I highly doubt that there will be enough colliding
invalidations and retries to build sufficient state to allow KVM to make a "good"
decision.

If you (or anyone else) have numbers to show that blocking until the invalidation
goes away provides meaningful value in most cases, then by all means. But without
numbers, I don't think we should go this route as it's not a clear win.

> write_lock_irq(&gpc->lock);
>
> @@ -269,6 +255,13 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
> * attempting to refresh.
> */
> WARN_ON_ONCE(gpc->valid);
> +
> + /*
> + * Since gfn_to_pfn_cache_invalidate() is called from the
> + * kvm_mmu_notifier_invalidate_range_end() callback, it can
> + * invalidate the GPC the moment after hva_to_pfn() returned
> + * a valid PFN. If that happens, retry.
> + */
> } while (!gpc->needs_invalidation);
>
> gpc->valid = true;
> --
> 2.44.0
>