Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory

From: David Hildenbrand
Date: Tue Aug 31 2021 - 15:07:19 EST

On 28.08.21 00:18, Sean Christopherson wrote:
On Thu, Aug 26, 2021, David Hildenbrand wrote:
You'll end up with a VMA that corresponds to the whole file in a single
process only, and that cannot vanish, not even in parts.

How would userspace tell the kernel to free parts of memory that it doesn't want
assigned to the guest, e.g. to free memory that the guest has converted to

I'd guess one possibility could be fallocate(FALLOC_FL_PUNCH_HOLE).

Questions are: when would it actually be allowed to perform such a destructive operation? Do we have to protect from that? How would KVM protect from user space replacing private pages by shared pages in any of the models we discuss?

Define "ordinary" user memory slots as overlay on top of "encrypted" memory
slots. Inside KVM, bail out if you encounter such a VMA inside a normal
user memory slot. When creating a "encryped" user memory slot, require that
the whole VMA is covered at creation time. You know the VMA can't change

This can work for the basic use cases, but even then I'd strongly prefer not to
tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind
the virtual address of a memslot, and when it does care, it tends to do poorly,
e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and
that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers
to handle mprotect()/munmap()/etc...

Right, and for the existing use cases this worked. But encrypted memory breaks many assumptions we once made ...

I have somewhat mixed feelings about pages that are mapped into $WHATEVER page tables but not actually mapped into user space page tables. There is no way to reach these via the rmap.

We have something like that already via vfio. And that is fundamentally broken when it comes to mmu notifiers, page pinning, page migration, ...

As is, I don't think KVM would get any kind of notification if userpaces unmaps
the VMA for a private memslot that does not have any entries in the host page
tables. I'm sure it's a solvable problem, e.g. by ensuring at least one page
is touched by the backing store, but I don't think the end result would be any
prettier than a dedicated API for KVM to consume.

Relying on VMAs, and thus the mmu_notifiers, also doesn't provide line of sight
to page migration or swap. For those types of operations, KVM currently just
reacts to invalidation notifications by zapping guest PTEs, and then gets the
new pfn when the guest re-faults on the page. That sequence doesn't work for
TDX or SEV-SNP because the trusteday agent needs to do the memcpy() of the page
contents, i.e. the host needs to call into KVM for the actual migration.

Right, but I still think this is a kernel internal. You can do such handshake later in the kernel IMHO.

But I also already thought: is it really KVM that is to perform the migration or is it the fd-provider that performs the migration? Who says memfd_encrypted() doesn't default to a TDX "backend" on Intel CPUs that just knows how to migrate such a page?

I'd love to have some details on how that's supposed to work, and which information we'd need to migrate/swap/... in addition to the EPFN and a new SPFN.

There's also the memory footprint side of things; the fd-based approach avoids
having to create host page tables for memory that by definition will never be
used by the host.

While that is true, that is not a compelling argument IMHO. No need to try to be better than state of the art if it results in something cleaner/better* just sticking with state of the art. Just like we don't have special interfaces to map $WHATEVER into a guest and bypassing user space page tables.

* to be shown what actually is cleaner/better. We don't really have prototypes for either.


David / dhildenb