Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory

From: Kirill A. Shutemov
Date: Thu Sep 02 2021 - 14:47:19 EST


Hi folks,

I try to sketch how the memfd changes would look like.

I've added F_SEAL_GUEST. The new seal is only allowed if there's no
pre-existing pages in the fd (i_mapping->nrpages check) and there's
no existing mapping of the file (RB_EMPTY_ROOT(&i_mapping->i_mmap.rb_root check).

After the seal is set, no read/write/mmap from userspace is allowed.

Although it's not clear how to serialize read check vs. seal setup: seal
is protected with inode_lock() which we don't hold in read path because it
is expensive. I don't know yet how to get it right. For TDX, it's okay to
allow read as it cannot trigger #MCE. Maybe we can allow it?

Truncate and punch hole are tricky.

We want to allow it to save memory if substantial range is converted to
shared. Partial truncate and punch hole effectively writes zeros to
partially truncated page and may lead to #MCE. We can reject any partial
truncate/punch requests, but it doesn't help the situation with THPs.

If we truncate to the middle of THP page, we try to split it into small
pages and proceed as usual for small pages. But split is allowed to fail.
If it happens we zero part of THP.
I guess we may reject truncate if split fails. It should work fine if we
only use it for saving memory.

We need to modify truncation/punch path to notify kvm that pages are about
to be freed. I think we will register callback in the memfd on adding the
fd to KVM memslot that going to be called for the notification. That means
1:1 between memfd and memslot. I guess it's okay.

Migration going to always fail on F_SEAL_GUEST for now. Can be modified to
use a callback in the future.

Swapout will also always fail on F_SEAL_GUEST. It seems trivial. Again, it
can be a callback in the future.

For GPA->PFN translation KVM could use vm_ops->fault(). Semantically it is
a good fit, but we don't have any VMAs around and ->mmap is forbidden for
F_SEAL_GUEST.
Other option is call shmem_getpage() directly, but it looks like a
layering violation to me. And it's not available to modules :/

Any comments?

--
Kirill A. Shutemov