Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
From: David Hildenbrand
Date: Wed Sep 15 2021 - 09:51:33 EST
diff --git a/mm/memfd.c b/mm/memfd.c
index 081dd33e6a61..ae43454789f4 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -130,11 +130,24 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
return NULL;
}
+int memfd_register_guest(struct inode *inode, void *owner,
+ const struct guest_ops *guest_ops,
+ const struct guest_mem_ops **guest_mem_ops)
+{
+ if (shmem_mapping(inode->i_mapping)) {
+ return shmem_register_guest(inode, owner,
+ guest_ops, guest_mem_ops);
+ }
+
+ return -EINVAL;
+}
Are we stick our design to memfd interface (e.g other memory backing
stores like tmpfs and hugetlbfs will all rely on this memfd interface to
interact with KVM), or this is just the initial implementation for PoC?
I don't think we are, it still feels like we are in the early prototype
phase (even way before a PoC). I'd be happy to see something "cleaner"
so to say -- it still feels kind of hacky to me, especially there seem
to be many pieces of the big puzzle missing so far. Unfortunately, this
series hasn't caught the attention of many -MM people so far, maybe
because other people miss the big picture as well and are waiting for a
complete design proposal.
For example, what's unclear to me: we'll be allocating pages with
GFP_HIGHUSER_MOVABLE, making them land on MIGRATE_CMA or ZONE_MOVABLE;
then we silently turn them unmovable, which breaks these concepts. Who'd
migrate these pages away just like when doing long-term pinning, or how
is that supposed to work?
Also unclear to me is how refcount and mapcount will be handled to
prevent swapping, who will actually do some kind of gfn-epfn etc.
mapping, how we'll forbid access to this memory e.g., via /proc/kcore or
when dumping memory ... and how it would ever work with
migration/swapping/rmap (it's clearly future work, but it's been raised
that this would be the way to make it work, I don't quite see how it
would all come together).
<note>
Last but not least, I raised to Intel via a different channel that I'd
appreciate updated hardware that avoids essentially crashing the
hypervisor when writing to encrypted memory from user space. It has the
smell of "broken hardware" to it that might just be fixed by a new
hardware generation to make it look more similar to other successful
implementations of secure/encrypted memory. That might it much easier to
support an initial version of TDX -- instead of having to reinvent the
way we map guest memory just now to support hardware that might sort out
the root problem later.
Having that said, there might be benefits to mapping guest memory
differently, but my gut feeling is that it might take quite a long time
to get something reasonable working, to settle on a design, and to get
it accepted by all involved parties to merge it upstream.
Just my 2 cents, I might be all wrong as so often.
<\note>
--
Thanks,
David / dhildenb