Re: [PATCH v10 10/16] KVM: x86: Introduce KVM_GET_SHARED_PAGES_LIST ioctl

From: Ashish Kalra
Date: Fri Mar 19 2021 - 14:00:53 EST


On Thu, Mar 11, 2021 at 12:48:07PM -0800, Steve Rutherford wrote:
> On Thu, Mar 11, 2021 at 10:15 AM Ashish Kalra <ashish.kalra@xxxxxxx> wrote:
> >
> > On Wed, Mar 03, 2021 at 06:54:41PM +0000, Will Deacon wrote:
> > > [+Marc]
> > >
> > > On Tue, Mar 02, 2021 at 02:55:43PM +0000, Ashish Kalra wrote:
> > > > On Fri, Feb 26, 2021 at 09:44:41AM -0800, Sean Christopherson wrote:
> > > > > On Fri, Feb 26, 2021, Ashish Kalra wrote:
> > > > > > On Thu, Feb 25, 2021 at 02:59:27PM -0800, Steve Rutherford wrote:
> > > > > > > On Thu, Feb 25, 2021 at 12:20 PM Ashish Kalra <ashish.kalra@xxxxxxx> wrote:
> > > > > > > Thanks for grabbing the data!
> > > > > > >
> > > > > > > I am fine with both paths. Sean has stated an explicit desire for
> > > > > > > hypercall exiting, so I think that would be the current consensus.
> > > > >
> > > > > Yep, though it'd be good to get Paolo's input, too.
> > > > >
> > > > > > > If we want to do hypercall exiting, this should be in a follow-up
> > > > > > > series where we implement something more generic, e.g. a hypercall
> > > > > > > exiting bitmap or hypercall exit list. If we are taking the hypercall
> > > > > > > exit route, we can drop the kvm side of the hypercall.
> > > > >
> > > > > I don't think this is a good candidate for arbitrary hypercall interception. Or
> > > > > rather, I think hypercall interception should be an orthogonal implementation.
> > > > >
> > > > > The guest, including guest firmware, needs to be aware that the hypercall is
> > > > > supported, and the ABI needs to be well-defined. Relying on userspace VMMs to
> > > > > implement a common ABI is an unnecessary risk.
> > > > >
> > > > > We could make KVM's default behavior be a nop, i.e. have KVM enforce the ABI but
> > > > > require further VMM intervention. But, I just don't see the point, it would
> > > > > save only a few lines of code. It would also limit what KVM could do in the
> > > > > future, e.g. if KVM wanted to do its own bookkeeping _and_ exit to userspace,
> > > > > then mandatory interception would essentially make it impossible for KVM to do
> > > > > bookkeeping while still honoring the interception request.
> > > > >
> > > > > However, I do think it would make sense to have the userspace exit be a generic
> > > > > exit type. But hey, we already have the necessary ABI defined for that! It's
> > > > > just not used anywhere.
> > > > >
> > > > > /* KVM_EXIT_HYPERCALL */
> > > > > struct {
> > > > > __u64 nr;
> > > > > __u64 args[6];
> > > > > __u64 ret;
> > > > > __u32 longmode;
> > > > > __u32 pad;
> > > > > } hypercall;
> > > > >
> > > > >
> > > > > > > Userspace could also handle the MSR using MSR filters (would need to
> > > > > > > confirm that). Then userspace could also be in control of the cpuid bit.
> > > > >
> > > > > An MSR is not a great fit; it's x86 specific and limited to 64 bits of data.
> > > > > The data limitation could be fudged by shoving data into non-standard GPRs, but
> > > > > that will result in truly heinous guest code, and extensibility issues.
> > > > >

We may also need to pass-through the MSR to userspace, as it is a part of this
complete host (userspace/kernel), OVMF and guest kernel negotiation of
the SEV live migration feature.

Host (userspace/kernel) advertises it's support for SEV live migration
feature via the CPUID bits, which is queried by OVMF and which in turn
adds a new UEFI runtime variable to indicate support for SEV live
migration, which is later queried during guest kernel boot and
accordingly the guest does a wrmrsl() to custom MSR to complete SEV
live migration negotiation and enable it.

Now, the GET_SHARED_REGION_LIST ioctl returns error, until this MSR write
enables SEV live migration, hence, preventing userspace to start live
migration before the feature support has been negotiated and enabled on
all the three components - host, guest OVMF and kernel.

But, now with this ioctl not existing anymore, we will need to
pass-through the MSR to userspace too, for it to only initiate live
migration once the feature negotiation has been completed.

Thanks,
Ashish

> > > > > The data limitation is a moot point, because the x86-only thing is a deal
> > > > > breaker. arm64's pKVM work has a near-identical use case for a guest to share
> > > > > memory with a host. I can't think of a clever way to avoid having to support
> > > > > TDX's and SNP's hypervisor-agnostic variants, but we can at least not have
> > > > > multiple KVM variants.
> > > >
> > > > Looking at arm64's pKVM work, i see that it is a recently introduced RFC
> > > > patch-set and probably relevant to arm64 nVHE hypervisor
> > > > mode/implementation, and potentially makes sense as it adds guest
> > > > memory protection as both host and guest kernels are running on the same
> > > > privilege level ?
> > > >
> > > > Though i do see that the pKVM stuff adds two hypercalls, specifically :
> > > >
> > > > pkvm_create_mappings() ( I assume this is for setting shared memory
> > > > regions between host and guest) &
> > > > pkvm_create_private_mappings().
> > > >
> > > > And the use-cases are quite similar to memory protection architectues
> > > > use cases, for example, use with virtio devices, guest DMA I/O, etc.
> > >
> > > These hypercalls are both private to the host kernel communicating with
> > > its hypervisor counterpart, so I don't think they're particularly
> > > relevant here. As far as I can see, the more useful thing is to allow
> > > the guest to communicate back to the host (and the VMM) that it has opened
> > > up a memory window, perhaps for virtio rings or some other shared memory.
> > >
> > > We hacked this up as a prototype in the past:
> > >
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fandroid-kvm.googlesource.com%2Flinux%2F%2B%2Fd12a9e2c12a52cf7140d40cd9fa092dc8a85fac9%255E%2521%2F&amp;data=04%7C01%7Cashish.kalra%40amd.com%7C21099e2dd82c477005a008d8e4cf149e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637510925361923243%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=JputIEFoae3HRO5jvvn%2FTpP8i%2FpMMfWSiZ8a2%2Bna5iQ%3D&amp;reserved=0
> > >
> > > but that's all arm64-specific and if we're solving the same problem as
> > > you, then let's avoid arch-specific stuff if possible. The way in which
> > > the guest discovers the interface will be arch-specific (we already have
> > > a mechanism for that and some hypercalls are already allocated by specs
> > > from Arm), but the interface back to the VMM and some (most?) of the host
> > > handling could be shared.
> > >
> >
> > I have started implementing a similar "hypercall to userspace"
> > functionality for these DMA_SHARE/DMA_UNSHARE type of interfaces
> > corresponding to SEV guest's add/remove shared regions on the x86 platform.
> >
> > This does not implement a generic hypercall exiting infrastructure,
> > mainly extends the KVM hypercall support to return back to userspace
> > specifically for add/remove shared region hypercalls and then re-uses
> > the complete userspace I/O callback functionality to resume the guest
> > after returning back from userspace handling of the hypercall.
> >
> > Looking fwd. to any comments/feedback/thoughts on the above.
> Others have mentioned a lack of appetite for generic hypercall
> intercepts, so this is the right approach.
>
> Thanks,
> Steve