[RFC PATCH 00/18] KVM: Post-copy live migration for guest_memfd

From: James Houghton
Date: Wed Jul 10 2024 - 19:43:00 EST


This patch series implements the KVM-based demand paging system that was
first introduced back in November[1] by David Matlack.

The working name for this new system is KVM Userfault, but that name is
very confusing so it will not be the final name.

Problem: post-copy with guest_memfd
===================================

Post-copy live migration makes it possible to migrate VMs from one host
to another no matter how fast they are writing to memory while keeping
the VM paused for a minimal amount of time. For post-copy to work, we
need:
1. to be able to prevent KVM from being able to access particular pages
of guest memory until we have populated it
2. for userspace to know when KVM is trying to access a particular
page.
3. a way to allow the access to proceed.

Traditionally, post-copy live migration is implemented using
userfaultfd, which hooks into the main mm fault path. KVM hits this path
when it is doing HVA -> PFN translations (with GUP) or when it itself
attempts to access guest memory. Userfaultfd sends a page fault
notification to userspace, and KVM goes to sleep.

Userfaultfd works well, as it is not specific to KVM; everyone who
attempts to access guest memory will block the same way.

However, with guest_memfd, we do not use GUP to translate from GFN to
HPA (nor is there an intermediate HVA).

So userfaultfd in its current form cannot be used to support post-copy
live migration with guest_memfd-backed VMs.

Solution: hook into the gfn -> pfn translation
==============================================

The only way to implement post-copy with a non-KVM-specific
userfaultfd-like system would be to introduce the concept of a
file-userfault[2] to intercept faults on a guest_memfd.

Instead, we take the simpler approach of adding a KVM-specific API, and
we hook into the GFN -> HVA or GFN -> PFN translation steps (for
traditional memslots and for guest_memfd respectively).

I have intentionally added support for traditional memslots, as the
complexity that it adds is minimal, and it is useful for some VMMs, as
it can be used to fully implement post-copy live migration.

Implementation Details
======================

Let's break down how KVM implements each of the three core requirements
for implementing post-copy as laid out above:

--- Preventing access: KVM_MEMORY_ATTRIBUTE_USERFAULT ---

The most straightforward way to inform KVM of userfault-enabled pages is
to use a new memory attribute, say KVM_MEMORY_ATTRIBUTE_USERFAULT.

There is already infrastructure in place for modifying and checking
memory attributes. Using this interface is slightly challenging, as there
is no UAPI for setting/clearing particular attributes; we must set the
exact attributes we want.

The synchronization that is in place for updating memory attributes is
not suitable for post-copy live migration either, which will require
updating memory attributes (from userfault to no-userfault) very
frequently.

Another potential interface could be to use something akin to a dirty
bitmap, where a bitmap describes which pages within a memslot (or VM)
should trigger userfaults. This way, it is straightforward to make
updates to the userfault status of a page cheap.

When KVM Userfault is enabled, we need to be careful not to map a
userfault page in response to a fault on a non-userfault page. In this
RFC, I've taken the simplest approach: force new PTEs to be PAGE_SIZE.

--- Page fault notifications ---

For page faults generated by vCPUs running in guest mode, if the page
the vCPU is trying to access is a userfault-enabled page, we use
KVM_EXIT_MEMORY_FAULT with a new flag: KVM_MEMORY_EXIT_FLAG_USERFAULT.

For arm64, I believe this is actually all we need, provided we handle
steal_time properly.

For x86, where returning from deep within the instruction emulator (or
other non-trivial execution paths) is infeasible, being able to pause
execution while userspace fetches the page, just as userfaultfd would
do, is necessary. Let's call these "asynchronous userfaults."

A new ioctl, KVM_READ_USERFAULT, has been added to read asynchronous
userfaults, and an eventfd is used to signal that new faults are
available for reading.

Today, we busy-wait for a gfn to have userfault disabled. This will
change in the future.

--- Fault resolution ---

Resolving userfaults today is as simple as removing the USERFAULT memory
attribute on the faulting gfn. This will change if we do not end up
using memory attributes for KVM Userfault. Having a range-based wake-up
like userfaultfd (see UFFDIO_WAKE) might also be helpful for
performance.

Problems with this series
=========================
- This cannot be named KVM Userfault! Perhaps "KVM missing pages"?
- Memory attribute modification doesn't scale well.
- We busy-wait for pages to not be userfault-enabled.
- gfn_to_hva and gfn_to_pfn caches are not invalidated.
- Page tables are not collapsed when KVM Userfault is disabled.
- There is no self-test for asynchronous userfaults.
- Asynchronous page faults can be dropped if KVM_READ_USERFAULT fails.
- Supports only x86 and arm64.
- Probably many more!

Thanks!

[1]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@xxxxxxxxxxxxxx/
[2]: https://lore.kernel.org/kvm/CADrL8HVwBjLpWDM9i9Co1puFWmJshZOKVu727fMPJUAbD+XX5g@xxxxxxxxxxxxxx/

James Houghton (18):
KVM: Add KVM_USERFAULT build option
KVM: Add KVM_CAP_USERFAULT and KVM_MEMORY_ATTRIBUTE_USERFAULT
KVM: Put struct kvm pointer in memslot
KVM: Fail __gfn_to_hva_many for userfault gfns.
KVM: Add KVM_PFN_ERR_USERFAULT
KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
KVM: Provide attributes to kvm_arch_pre_set_memory_attributes
KVM: x86: Add KVM Userfault support
KVM: x86: Add vCPU fault fast-path for Userfault
KVM: arm64: Add KVM Userfault support
KVM: arm64: Add vCPU memory fault fast-path for Userfault
KVM: arm64: Add userfault support for steal-time
KVM: Add atomic parameter to __gfn_to_hva_many
KVM: Add asynchronous userfaults, KVM_READ_USERFAULT
KVM: guest_memfd: Add KVM Userfault support
KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
KVM: selftests: Add KVM Userfault mode to demand_paging_test
KVM: selftests: Remove restriction in vm_set_memory_attributes

Documentation/virt/kvm/api.rst | 23 ++
arch/arm64/include/asm/kvm_host.h | 2 +-
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/arm.c | 8 +-
arch/arm64/kvm/mmu.c | 45 +++-
arch/arm64/kvm/pvtime.c | 11 +-
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/mmu/mmu.c | 67 +++++-
arch/x86/kvm/mmu/mmu_internal.h | 3 +-
include/linux/kvm_host.h | 41 +++-
include/uapi/linux/kvm.h | 13 ++
.../selftests/kvm/demand_paging_test.c | 46 +++-
.../testing/selftests/kvm/include/kvm_util.h | 7 -
virt/kvm/Kconfig | 4 +
virt/kvm/guest_memfd.c | 16 +-
virt/kvm/kvm_main.c | 213 +++++++++++++++++-
16 files changed, 457 insertions(+), 44 deletions(-)


base-commit: 02b0d3b9d4dd1ef76b3e8c63175f1ae9ff392313
--
2.45.2.993.g49e7a77208-goog