Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2

From: Michael Roth

Date: Tue Apr 07 2026 - 17:10:15 EST

On Fri, Apr 03, 2026 at 07:50:16AM -0700, Ackerley Tng wrote:
> Ackerley
> dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnng <ackerleytng@xxxxxxxxxx> writes:
>
> >
> > [...snip...]
> >
> > guest_memfd's populate will first check that the memory is shared, then
> > also set the memory to private after the populate.
> >
> > [...snip...]
> >
> Looking at this again, the above basically means that the entire
> conversion process needs to be performed within populate.
>
> In addition to setting the attributes in guest_memfd as private, for
> consistency, populate will also have to do all the associated
> operations, especially unmapping from the host, checking refcounts,
> and the list of work in conversion will only increase in future with
> direct map removal/restoration and huge page merging.
>
> The complexity of conversion also means possible errors (EAGAIN for
> elevated refcounts and ENOMEM when we're out of memory), and error
> information like the offset where the elevated refcount was.
>
> It doesn't look like there's room for this information to be plumbed out
> through the platform-specific ioctls, and even if we make space, it
> seems odd to have conversion-related error information returned through
> the platform-specific call.
>
>
> I agree with the goal of not having KVM touch private memory contents as
> tracked by guest_memfd, so I'd like to propose that we distinguish:
>
> 1. private as tracked by KVM (guest_memfd/vm_memory_attributes)
> 2. private as tracked by trusted entity

I think this is a good distinction to keep in mind, because if we adopt
the proposal from the call of having userspace set the destination memory
to shared prior to calling kvm_gmem_populate(), then they don't really
stay shared until gmem convert them to private: instead, they get set to
"private as tracked by trusted entity", but at the same time still have
'shared' memory attributes as far as KVM is concerned. Normally (SNP,
at least), the 'private (as tracked by KVM)' state is an intermediate state
on the way to 'private (as tracked by KVM + trusted entity)'.

So we introduce some inconsistencies on that side, in order to address
the inconsistency of kvm_gmem_populate() writing to 'private (as tracked
by KVM)' memory. But as you point out...

> + destination address: private (as tracked by guest_memfd)
> + source address: shared (as tracked by guest_memfd) or NULL
>
> KVM doesn't touch private memory contents, even in this case, because
> it's really a platform-specific ioctl that handles loading, and the
> platform does expect the destination is private for both TDX and SNP
> at the firmware boundary.

...yah, it's not really gmem that's writing to that memory, it's the
platform-specific hooks that 'prepare' the memory as part of population
and puts that in a 'private (as tracked by trusted entity)' state, just as
it's the platform-specific hooks that 'prepare' the memory as part of vCPU
page fault path at run-time and put them into a private (as tracked by
trusted entity). You could even imagine a naive CoCo implementation that
encrypts memory in-place at initial fault time via kvm_gmem_prepare()
hooks... we likely wouldn't insist on some new flow because this results
in gmem calling something that writes to 'private (as tracked by KVM)'
pages and would consider that to be more of a platform-specific
implementation detail that should be handled the same as other
architectures. And that seems like it would be roughly analogous to what
is being discussed here WRT the kvm_gmem_populate() path, so I think it
makes sense to continue to expecting the pages to be marked private in
advance of platform-specific preparation, whether that be via the
populate path or the runtime/fault-time path.

And for recent KVM,

; all the things we exp
about how the callbacks

As far as the copying goes,

By expecting 'private' (as tracked by KVM) as the initial state for
kvm_gmem_populate(), a lot of invariants about private memory (safe
refcount, directmap removal expectations, etc.) remain consistent even
in the populate path, where any special handling for private memory can be
accounted for in the same way rather than "shared, but..." or "private,
but...".
>
> Since SNP (platform-specific) only allows in-place launch update, and
> KVM had to provide an interface that allows a different source address
> for support before in-place conversion, then SNP has to continue
> supporting the to-be-deprecated path by doing the copying within
> platform-specific code.
>
> For consistency, guest_memfd can continue to check that it tracks the
> destination address as private, and sev_gmem_populate will then hide
> the copying code away just to support the legacy case.
>
>
> The flow before in-place conversion is
>
> 1. Load memory (shared or non-guest_memfd memory)
> 2. KVM_SEV_SNP_LAUNCH_UPDATE or KVM_TDX_INIT_MEM_REGION (destination:
> gfn for separate private memory, source: shared memory)
>
> The proposed flow for in-place conversion is
>
> 1. INIT_SHARED or convert to shared
> 2. Load memory while it is shared
> 3. Convert to private (PRESERVE, or new flag?)
> 4. KVM_SEV_SNP_LAUNCH_UPDATE or KVM_TDX_INIT_MEM_REGION (destination:
> gfn for converted private memory, source: NULL)
>
>
> TLDR:
>
> + Think of populate ioctls not as KVM touching memory, but platform
> handling population.
> + KVM code (kvm_gmem_populate) still doesn't touch memory contents
> + post_populate is platform-specific code that handles loading into
> private destination memory just to support legacy non-in-place
> conversion.
> + Don't complicate populate ioctls by doing conversion just to support
> legacy use-cases where platform-specific code has to do copying on
> the host.

That's a good point: these are only considerations in the context of
actually copying from src->dst, but with in-place conversion the
primary/more-performant approach will be for userspace to initial
directly. I.e. if we enforced that, then gmem could right ascertain that
it isn't even writing to private pages via these hooks and any
manipulation of that memory is purely on the part of the trusted entity
handling initial encryption/etc.

I understand that we decided to keep the option of allowing separate
src/dst even with in-place conversion, but it doesn't seem worthwhile if
that necessarily means we need to glue population+conversion together in
1 clumsy interface that needs to handle partial return/error responses to
userspace (or potentially get stuck forever in the conversion path).

So I agree with Ackerley's proposal (which I guess is the same as what's
in this series).

However, 1 other alternative would be to do what was suggested on the
call, but require userspace to subsequently handle the shared->private
conversion. I think that would be workable too.

One other benefit to Ackerley's/current approach however is that it allows
us to potentially keep hugepages intact in the populate path, since
prep'ing/encrypting everything while it's in a shared state means gmem will
split the hugepage and all the firmware/RMP/etc. data structures will only
be able to handle individual 4K pages. I still suspect doing things like
encoding the initial 2MB OVMF image as a single hugepage might yield
enough benefit to explore this (at some point). So there's some niceness
in knowing that Ackerley's approach would allow for that eventually and
not require a complete rethink on these same topics.

Thanks,

Mike

>
> >>>
> >>> [...snip...]
> >>>