Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory

From: David Hildenbrand
Date: Wed Sep 15 2021 - 09:51:33 EST


diff --git a/mm/memfd.c b/mm/memfd.c
index 081dd33e6a61..ae43454789f4 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,11 +130,24 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
return NULL;
}
+int memfd_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops)
+{
+ if (shmem_mapping(inode->i_mapping)) {
+ return shmem_register_guest(inode, owner,
+ guest_ops, guest_mem_ops);
+ }
+
+ return -EINVAL;
+}

Are we stick our design to memfd interface (e.g other memory backing
stores like tmpfs and hugetlbfs will all rely on this memfd interface to
interact with KVM), or this is just the initial implementation for PoC?

I don't think we are, it still feels like we are in the early prototype phase (even way before a PoC). I'd be happy to see something "cleaner" so to say -- it still feels kind of hacky to me, especially there seem to be many pieces of the big puzzle missing so far. Unfortunately, this series hasn't caught the attention of many -MM people so far, maybe because other people miss the big picture as well and are waiting for a complete design proposal.

For example, what's unclear to me: we'll be allocating pages with GFP_HIGHUSER_MOVABLE, making them land on MIGRATE_CMA or ZONE_MOVABLE; then we silently turn them unmovable, which breaks these concepts. Who'd migrate these pages away just like when doing long-term pinning, or how is that supposed to work?

Also unclear to me is how refcount and mapcount will be handled to prevent swapping, who will actually do some kind of gfn-epfn etc. mapping, how we'll forbid access to this memory e.g., via /proc/kcore or when dumping memory ... and how it would ever work with migration/swapping/rmap (it's clearly future work, but it's been raised that this would be the way to make it work, I don't quite see how it would all come together).

<note>
Last but not least, I raised to Intel via a different channel that I'd appreciate updated hardware that avoids essentially crashing the hypervisor when writing to encrypted memory from user space. It has the smell of "broken hardware" to it that might just be fixed by a new hardware generation to make it look more similar to other successful implementations of secure/encrypted memory. That might it much easier to support an initial version of TDX -- instead of having to reinvent the way we map guest memory just now to support hardware that might sort out the root problem later.

Having that said, there might be benefits to mapping guest memory differently, but my gut feeling is that it might take quite a long time to get something reasonable working, to settle on a design, and to get it accepted by all involved parties to merge it upstream.

Just my 2 cents, I might be all wrong as so often.
<\note>

--
Thanks,

David / dhildenb