Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

From: Sean Christopherson
Date: Thu Aug 24 2023 - 11:16:30 EST


On Thu, Aug 24, 2023, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:25 PM Yu Zhang <yu.c.zhang@xxxxxxxxxxxxxxx> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > @@ -4529,7 +4540,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> > >
> > > out_unlock:
> > > read_unlock(&vcpu->kvm->mmu_lock);
> > > - kvm_release_pfn_clean(fault->pfn);
> >
> > Yet kvm_release_pfn() can still be triggered for the kvm_vcpu_maped gfns.
> > What if guest uses a non-referenced page(e.g., as a vmcs12)? Although I
> > believe this is not gonna happen in real world...
>
> kvm_vcpu_map still uses gfn_to_pfn, which eventually passes FOLL_GET
> to __kvm_follow_pfn. So if a guest tries to use a non-refcounted page
> like that, then kvm_vcpu_map will fail and the guest will probably
> crash. It won't trigger any bugs in the host, though.
>
> It is unfortunate that the guest will be able to use certain types of
> memory for some purposes but not for others. However, while it is
> theoretically fixable, it's an unreasonable amount of work for
> something that, as you say, nobody really cares about in practice [1].
>
> [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@xxxxxxxxxx/

There are use cases that care, which is why I suggested allow_unsafe_kmap.
Specifically, AWS manages their pool of guest memory in userspace and maps it all
via /dev/mem. Without that module param to let userspace opt-in, this series will
break such setups. It still arguably is a breaking change since it requires
userspace to opt-in, but allowing such behavior by default is simply not a viable
option, and I don't have much sympathy since so much of this mess has its origins
in commit e45adf665a53 ("KVM: Introduce a new guest mapping API").

The use cases that no one cares about (AFAIK) is allowing _untrusted_ userspace
to back guest RAM with arbitrary memory. In other words, I want KVM to allow
(by default) mapping device memory into the guest for things like vGPUs, without
having to do the massive and invasive overhaul needed to safely allow backing guest
RAM with completely arbitrary memory.