Re: [PATCH 0/3] KVM: x86: replace kvm_vcpu_map usage in vmx

From: Sean Christopherson
Date: Tue Mar 14 2023 - 19:33:37 EST


On Fri, Jan 27, 2023, David Stevens wrote:
> From: David Stevens <stevensd@xxxxxxxxxxxx>
>
> This series replaces the usage of kvm_vcpu_map in vmx with
> gfn_to_pfn_cache. See [1] for details on why kvm_vcpu_map is broken.
>
> The presence of kvm_vcpu_map blocks another series I would like to
> try to merge [2]. Although I'm not familiar with the internals of vmx,
> I've gone ahead and taken a stab at this cleanup. I've done some manual
> testing with nested VMs, and KVM selftests pass, but thorough feedback
> would be appreciated. Once this cleanup is done, I'll take a look at
> removing kvm_vcpu_map from svm.

First off, thank you for writing this code and posting the series. Seeing the
actual impact on the nVMX code was super helpful. Second, this response is likely
going to be very critical, but the criticism is not directed at you, it's directed
at the existing mess in KVM. On to the review...

I went fairly deep into review before coming to the conclusion that adding generic
support for mapping unpinned pages outside of the MMU is likely a waste of time
and effort, and will saddle KVM with a pile of complexity that has no real usage
outside of KVM selftests.

There are bugs all over the nVMX code in this series. Some are due to subtleties
in nVMX that aren't super difficult to resolve, but many of them are because the
core gfn_to_pfn_cache code is half-baked. E.g. KVM_REQ_OUTSIDE_GUEST_MODE is
completely useless as kicking the guest does absolutely nothing to guarantee KVM
refreshes the cache before re-entering the guest. That's not too horrible to
solved, e.g. the gpc struct could be expanded to track the KVM request associated
with guest usage so that nVMX (and nSVM) could trigger KVM_REQ_GET_NESTED_STATE_PAGES.

But then nested_get_vmcs12_pages() becomes even more complex because only some of
the pages/pfns it manages are actually plugged into vmcs02. E.g. the vmcs12 MSR
bitmap is read only at VM-Enter, what gets stuffed into vmcs02 is a KVM-allocated
bitmap. Using kvm_vcpu_map() is purely an optimization+simplification that allows
KVM to use a common set of helpers for accessing MSR bitmaps, e.g. as opposed to
having to do a big pile of uaccesses or copy the entire bitmaps into kernel memory.
The PDPTRs, which aren't actually managed through a gpc cache, would also need to
be separated out.

More problems arise in the use of kvm_gpc_check(). In hindsight, not adding a
lockdep assertion that gpc->lock is held in kvm_gpc_check() is a glaring flaw.
vmx_get_nested_state_pages() and vmx_complete_nested_posted_interrupt() fail to
take the gpc->lock when checking the cache. Again, not super difficult to resolve,
but the complexity continues to amount, especially when trying to reason about what
will happen if an invalidation occurs and sets KVM_REQ_GET_NESTED_STATE_PAGES while
vmx_get_nested_state_pages() is refreshing a cache (and thus has dropped gpc->lock).

Even more subtly broken is nested.pi_desc, which is snapshot from
vmx->nested.pi_desc_gpc.khva, but is not protected by the gpc stuff. I honestly
don't even want to think about what vmx_get_nested_state_pages() would look like
if it were fully protected.

Lastly, the number of nasty, fatal bugs we had to fix in the KVM Xen code just to
get KVM usage of gfn_to_pfn_cache healthy doesn't exactly instill me with confidence
that stabilizing guest usage will be all rainbows and butterflies.

So, I propose that we scrap this series along with the entire concept of
KVM_GUEST_USES_PFN, and instead add an off-by-default module param (could be a
per-VM capability in the future if necessary) that controls whether or not
kvm_vcpu_map() is allowed to map non-refcounted memory.

AIUI, AWS cares about mapping arbitrary memory via kvm_vcpu_map(), but uses a
static setup and trusts userspace, at least to the extent where guarding against
userspace unmapping (or moving, etc.) memory that is mapped via kvm_vcpu_map() is
a non-goal.

And on the ChromeOS side, the use case doesn't require playing nice with
kvm_vcpu_map(), the goal is purely to allow mapping non-refcounted "struct page"
memory into the guest so that it can be accessed by a vGPU (or some oddball
pass-through GPU setup?).

Unless I'm missing something, that approach would even give us something that is
consumable for stable backports, e.g. I believe the diff at the bottom is all that
is needed to close the existing holes (feel free to suggest a less scary name).

Then the only thing David S. and company need to sort out for their GPU use case
is teaching KVM to do the right thing with non-refcounted struct page memory.
And I don't feel guilty for asking them to do that work because it's actually
germane to their use case, as opposed to fixing the nVMX/nSVM mess.

And if anyone comes along with a use for unmapping/migrating/swapping memory that
is plugged directly into vmcs02, they get to have the honor of adding support and
convincing everyone that it's worth maintaining.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d255964ec331..c7669807cf73 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -93,6 +93,9 @@ unsigned int halt_poll_ns_shrink;
module_param(halt_poll_ns_shrink, uint, 0644);
EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);

+static bool allow_unsafe_kmap;
+module_param(allow_unsafe_kmap);
+
/*
* Ordering of locks:
*
@@ -2838,7 +2841,7 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
page = pfn_to_page(pfn);
hva = kmap(page);
#ifdef CONFIG_HAS_IOMEM
- } else {
+ } else if (allow_unsafe_kmap) {
hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
#endif
}