[RFC PATCH 0/8] Unmapping guest_memfd from Direct Map

From: Patrick Roy
Date: Tue Jul 09 2024 - 09:21:59 EST


Hey all,

This RFC series is a rough draft adding support for running
non-confidential compute VMs in guest_memfd, based on prior discussions
with Sean [1]. Our specific usecase for this is the ability to unmap
guest memory from the host kernel's direct map, as a mitigation against
a large class of speculative execution issues.

=== Implementation ===

This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD`
to remove its pages from the direct map when they are allocated. When
trying to run a guest from such a VM, we now face the problem that
without either userspace or kernelspace mappings of guest_memfd, KVM
cannot access guest memory to, for example, do MMIO emulation of access
memory used to guest/host communication. We have multiple options for
solving this when running non-CoCo VMs: (1) implement a TDX-light
solution, where the guest shares memory that KVM needs to access, and
relies on paravirtual solutions where this is not possible (e.g. MMIO),
(2) have KVM use userspace mappings of guest_memfd (e.g. a
memfd_secret-style solution), or (3) dynamically reinsert pages into the
direct map whenever KVM wants to access them.

This RFC goes for option (3). Option (1) is a lot of overhead for very
little gain, since we are not actually constrained by a physical
inability to access guest memory (e.g. we are not in a TDX context where
accesses to guest memory cause a #MC). Option (2) has previously been
rejected [1].

In this patch series, we make sufficient parts of KVM gmem-aware to be
able to boot a Linux initrd from private memory on x86. These include
KVM's MMIO emulation (including guest page table walking) and kvm-clock.
For VM types which do not allow accessing gmem, we return -EFAULT and
attempt to prepare a KVM_EXIT_MEMORY_FAULT.

Additionally, this patch series adds support for "restricted" userspace
mappings of guest_memfd, which work similar to memfd_secret (e.g.
disallow get_user_pages), which allows handling I/O and loading the
guest kernel in a simple way. Support for this is completely independent
of the rest of the functionality introduced in this patch series.
However, it is required to build a minimal hypervisor PoC that actually
allows booting a VM from a disk.

=== Performance ===

We have run some preliminary performance benchmarks to assess the impact
of on-the-fly direct map manipulations. We were mainly interested in the
impact of manipulating the direct map for MMIO emulation on virtio-mmio.
Particularly, we were worried about the impact of the TLB and L1/2/3
Cache flushes that set_memory_[n]p entails.

In our setup, we have taken a modified Firecracker VMM, spawned a Linux
guest with 1 vCPU, and used fio to stress a virtio_blk device. We found
that the cache flushes caused throughput to drop from around 600MB/s to
~50MB/s (~90%) for both reads and writes (on a Intel(R) Xeon(R) Platinum
8375C CPU with 64 cores). We then converted our prototype to use
set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p and
found that without cache flushes the pure impact of the direct map
manipulation is indistinguishable from noise. This is why we use
set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p in
this RFC.

Note that in this comparison, both the baseline, as well as the
guest_memfd-supporting version of Firecracker were made to bounce I/O
buffers in VMM userspace. As GUP is disabled for the guest_memfd VMAs,
the virtio stack cannot directly pass guest buffers to read/write
syscalls.

=== Security ===

We want to use unmapping guest memory from the host kernel as a security
mitigation against transient execution attacks. Temporarily restoring
direct map entries whenever KVM requires access to guest memory leaves a
gap in this mitigation. We believe this to be acceptable for the above
cases, since pages used for paravirtual guest/host communication (e.g.
kvm-clock) and guest page tables do not contain sensitive data. MMIO
emulation will only end up reading pages containing privileged
instructions (e.g. guest kernel code).

=== Summary ===

Patches 1-4 are about hot-patching various points inside of KVM that
access guest memory to correctly handle the case where memory happens to
be guest-private. This means either handling the access as a memory
error, or simply accessing the memslot's guest_memfd instead of looking
at the userspace provided VMA if the VM type allows these kind of
accesses. Patches 5-6 add a flag to KVM_CREATE_GUEST_MEMFD that will
make it remove its pages from the kernel's direct map. Whenever KVM
wants to access guest-private memory, it will temporarily re-insert the
relevant pages. Patches 7-8 allow for restricted userspace mappings
(e.g. get_user_pages paths are disabled like for memfd_secret) of
guest_memfd, so that userspace has an easy path for loading the guest
kernel and handling I/O-buffers.

=== ToDos / Limitations ===

There are still a few rough edges that need to be addressed before
dropping the "RFC" tag, e.g.

* Handle errors of set_direct_map_default_not_flush in
kvm_gmem_invalidate_folio instead of calling BUG_ON
* Lift the limitation of "at most one gfn_to_pfn_cache for each
gfn/pfn" in e1c61f0a7963 ("kvm: gmem: Temporarily restore direct map
entries when needed"). It currently means that guests with more than 1
vcpu fail to boot, because multiple vcpus can put their kvm-clock PV
structures into the same page (gfn)
* Write selftests, particularly around hole punching, direct map removal,
and mmap.

Lastly, there's the question of nested virtualization which Sean brought
up in previous discussions, which runs into similar problems as MMIO. I
have looked at it very briefly. On Intel, KVM uses various gfn->uhva
caches, which run in similar problems as the gfn_to_hva_caches dealt
with in 200834b15dda ("kvm: use slowpath in gfn_to_hva_cache if memory
is private"). However, previous attempts at just converting this to
gfn_to_pfn_cache (which would make them work with guest_memfd) proved
complicated [2]. I suppose initially, we should probably disallow nested
virtualization in VMs that have their memory removed from the direct
map.

Best,
Patrick

[1]: https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@xxxxxxxxxx/
[2]: https://lore.kernel.org/kvm/ZBEEQtmtNPaEqU1i@xxxxxxxxxx/

Patrick Roy (8):
kvm: Allow reading/writing gmem using kvm_{read,write}_guest
kvm: use slowpath in gfn_to_hva_cache if memory is private
kvm: pfncache: enlighten about gmem
kvm: x86: support walking guest page tables in gmem
kvm: gmem: add option to remove guest private memory from direct map
kvm: gmem: Temporarily restore direct map entries when needed
mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
kvm: gmem: Allow restricted userspace mappings

arch/x86/kvm/mmu/paging_tmpl.h | 94 +++++++++++++++++++-----
include/linux/kvm_host.h | 5 ++
include/linux/kvm_types.h | 1 +
include/linux/secretmem.h | 13 +++-
include/uapi/linux/kvm.h | 2 +
mm/secretmem.c | 6 +-
virt/kvm/guest_memfd.c | 83 +++++++++++++++++++--
virt/kvm/kvm_main.c | 112 +++++++++++++++++++++++++++-
virt/kvm/pfncache.c | 130 +++++++++++++++++++++++++++++----
9 files changed, 399 insertions(+), 47 deletions(-)


base-commit: 890a64810d59b1a58ed26efc28cfd821fc068e84
--
2.45.2