Re: [PATCH 2/3] KVM: x86: Add support for VMware guest specific hypercalls
From: Sean Christopherson
Date: Mon Feb 03 2025 - 14:41:46 EST
On Mon, Feb 03, 2025, Doug Covelli wrote:
> On Mon, Feb 3, 2025 at 1:22 PM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
> >
> > On Mon, Feb 3, 2025 at 5:35 PM Doug Covelli <doug.covelli@xxxxxxxxxxxx> wrote:
> > > OK. It seems like fully embracing the in-kernel APIC is the way to go
> > > especially considering it really simplifies using KVM's support for nested
> > > virtualization. Speaking of nested virtualization we have been working on
> > > adding support for that and would like to propose a couple of changes:
> > >
> > > - Add an option for L0 to handle backdoor accesses from CPL3 code running in L2.
> > > On a #GP nested_vmx_l0_wants_exit can check if this option is enabled and KVM
> > > can handle the #GP like it would if it had been from L1 (exit to userlevel iff
> > > it is a backdoor access otherwwise deliver the fault to L2). When combined with
> > > enable_vmware_backdoor this will allow L0 to optionally handle backdoor accesses
> > > from CPL3 code running in L2. This is needed for cases such as running VMware
> > > tools in a Windows VM with VBS enabled. For other cases such as running tools
> > > in a Windows VM in an ESX VM we still want L1 to handle the backdoor accesses
> > > from L2.
> >
> > I think this makes sense and could be an argument to KVM_ENABLE_CAP.
> >
> > > - Extend KVM_EXIT_MEMORY_FAULT for permission faults (e.g the guest attempting
> > > to write to a page that has been protected by userlevel calling mprotect). This
> > > is useful for cases where we want synchronous detection of guest writes such as
> > > lazy snapshots (dirty page tracking is no good for this case). Currently
> > > permission faults result in KVM_RUN returning EFAULT which we handle by
> > > interpreting the instruction as we do not know the guest physical address
> > > associated with the fault.
> >
> > Yes, this makes sense too, though you might want to look into
> > userfaultfd as well.
> >
> > We had something planned using attributes, but I don't see any issue
> > extending it to EFAULT. Maybe it would have to be yet another
> > KVM_ENABLE_CAP; considering that it would break your existing code,
> > there might be someone else in the wild doing it.
>
> It looks like KVM_EXIT_MEMORY_FAULT was implemented in such a way that it
> won't break existing code:
>
> Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in
> that it accompanies a return code of ‘-1’, not ‘0’! errno will always
> be set to EFAULT or EHWPOISON when KVM exits with
> KVM_EXIT_MEMORY_FAULT, userspace should assume kvm_run.exit_reason is
> stale/undefined for all other error numbers.
>
> That being said we could certainly make this opt-in if that is preferable.
-EFAULT isn't the problem, KVM not being able to return useful information in
all situations is the issue. Specifically, "guest" accesses that are emulated
by KVM are problematic, because the -EFAULT from e.g. __kvm_write_guest_page()
is disconnected from the code that actually kicks out to userspace. In that
case, userspace will get KVM_EXIT_MMIO, not -EFAULT. There are more problems
beyond KVM_EXIT_MMIO vs. -EFAULT, e.g. instructions that perform multiple memory
accesses, "failures" that are squashed and never propagated to userspace (PV
features tend to do this), page splits, etc.
In general, I don't expect most KVM access to guest memory to Just Work, as I
doubt KVM will behave as you want.
We spent a lot of time trying to sort out a viable approach in the context of the
USERFAULT_ON_MISSING series[1], and ultimately gave up (ignoring that we postponed
the entire series)[2], because we decided that fully solving KVM accesses would
require an absurd amount of effort and churn, and wasn't at all necessary for the
userfault use case.
What exactly needs to happen on "synchronous detection of guest writes"? One
idea (which may be horribly flawed as I have put *very* little thought into it)
would be to implement a module (or KVM extension) that utilizes KVM's "external"
write-tracking APIs to get the synchronous notifications (see
arch/x86/include/asm/kvm_page_track.h).
[1] https://lore.kernel.org/all/ZIn6VQSebTRN1jtX@xxxxxxxxxx
[2] https://lore.kernel.org/all/ZR88w9W62qsZDro-@xxxxxxxxxx