Re: [RFC PATCH v2 00/51] 1G page support for guest_memfd

From: Vishal Annapurve
Date: Tue Jul 08 2025 - 15:29:21 EST


On Tue, Jul 8, 2025 at 11:03 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Tue, Jul 08, 2025, Rick P Edgecombe wrote:
> > On Tue, 2025-07-08 at 10:16 -0700, Vishal Annapurve wrote:
> > > > Right, I read that. I still don't see why pKVM needs to do normal
> > > > private/shared
> > > > conversion for data provisioning. Vs a dedicated operation/flag to make it a
> > > > special case.
> > >
> > > It's dictated by pKVM usecases, memory contents need to be preserved
> > > for every conversion not just for initial payload population.
> >
> > We are weighing pros/cons between:
> > - Unifying this uABI across all gmemfd VM types
> > - Userspace for one VM type passing a flag for it's special non-shared use case
> >
> > I don't see how passing a flag or not is dictated by pKVM use case.
>
> Yep. Baking the behavior of a single usecase into the kernel's ABI is rarely a
> good idea. Just because pKVM's current usecases always wants contents to be
> preserved doesn't mean that pKVM will never change.
>
> As a general rule, KVM should push policy to userspace whenever possible.
>
> > P.S. This doesn't really impact TDX I think. Except that TDX development needs
> > to work in the code without bumping anything. So just wishing to work in code
> > with less conditionals.
> >
> > >
> > > >
> > > > I'm trying to suggest there could be a benefit to making all gmem VM types
> > > > behave the same. If conversions are always content preserving for pKVM, why
> > > > can't userspace always use the operation that says preserve content? Vs
> > > > changing the behavior of the common operations?
> > >
> > > I don't see a benefit of userspace passing a flag that's kind of
> > > default for the VM type (assuming pKVM will use a special VM type).
> >
> > The benefit is that we don't need to have special VM default behavior for
> > gmemfd. Think about if some day (very hypothetical and made up) we want to add a
> > mode for TDX that adds new private data to a running guest (with special accept
> > on the guest side or something). Then we might want to add a flag to override
> > the default destructive behavior. Then maybe pKVM wants to add a "don't
> > preserve" operation and it adds a second flag to not destroy. Now gmemfd has
> > lots of VM specific flags. The point of this example is to show how unified uABI
> > can he helpful.
>
> Yep again. Pivoting on the VM type would be completely inflexible. If pKVM gains
> a usecase that wants to zero memory on conversions, we're hosed. If SNP or TDX
> gains the ability to preserve data on conversions, we're hosed.
>
> The VM type may restrict what is possible, but (a) that should be abstracted,
> e.g. by defining the allowed flags during guest_memfd creation, and (b) the
> capabilities of the guest_memfd instance need to be communicated to userspace.

Ok, I concur with this: It's beneficial to keep a unified ABI that
allows guest_memfd to make runtime decisions without relying on VM
type as far as possible.

Few points that seem important here:
1) Userspace can and should be able to only dictate if memory contents
need to be preserved on shared to private conversion.
-> For SNP/TDX VMs:
* Only usecase for preserving contents is initial memory
population, which can be achieved by:
- Userspace converting the ranges to shared,
populating the contents, converting them back to private and then
calling SNP/TDX specific existing ABI functions.
* For runtime conversions, guest_memfd can't ensure memory
contents are preserved during shared to private conversions as the
architectures don't support that behavior.
* So IMO, this "preserve" flag doesn't make sense for SNP/TDX
VMs, even if we add this flag, today guest_memfd should effectively
mark this unsupported based on the backing architecture support.
2) For pKVM, if userspace wants to specify a "preserve" flag then this
flag can be allowed based on the known capabilities of the backing
architecture.

So this topic is still orthogonal to "zeroing on private to shared conversion".





>
> > > Common operations in guest_memfd will need to either check for the
> > > userspace passed flag or the VM type, so no major change in
> > > guest_memfd implementation for either mechanism.
> >
> > While we discuss ABI, we should allow ourselves to think ahead. So, is a gmemfd
> > fd tied to a VM?
>
> Yes.
>
> > I think there is interest in de-coupling it?
>
> No? Even if we get to a point where multiple distinct VMs can bind to a single
> guest_memfd, e.g. for inter-VM shared memory, there will still need to be a sole
> owner of the memory. AFAICT, fully decoupling guest_memfd from a VM would add
> non-trivial complexity for zero practical benefit.
>
> > Is the VM type sticky?
> >
> > It seems the more they are separate, the better it will be to not have VM-aware
> > behavior living in gmem.
>
> Ya. A guest_memfd instance may have capabilities/features that are restricted
> and/or defined based on the properties of the owning VM, but we should do our
> best to make guest_memfd itself blissly unaware of the VM type.