[PATCH v1 00/13] KVM: Introduce KVM Userfault
From: James Houghton
Date: Wed Dec 04 2024 - 16:24:41 EST
This is a continuation of the original KVM Userfault RFC[1] from July.
It contains the simplifications we talked about at LPC[2].
Please see the RFC[1] for the problem description. In summary,
guest_memfd VMs have no mechanism for doing post-copy live migration.
KVM Userfault provides such a mechanism. Today there is no upstream
mechanism for installing memory into a guest_memfd, but there will
be one soon (e.g. [3]).
There is a second problem that KVM Userfault solves: userfaultfd-based
post-copy doesn't scale very well. KVM Userfault when used with
userfaultfd can scale much better in the common case that most post-copy
demand fetches are a result of vCPU access violations. This is a
continuation of the solution Anish was working on[4]. This aspect of
KVM Userfault is important for userfaultfd-based live migration when
scaling up to hundreds of vCPUs with ~30us network latency for a
PAGE_SIZE demand-fetch.
The implementation in this series is version than the RFC[1]. It adds...
1. a new memslot flag is added: KVM_MEM_USERFAULT,
2. a new parameter, userfault_bitmap, into struct kvm_memory_slot,
3. a new KVM_RUN exit reason: KVM_MEMORY_EXIT_FLAG_USERFAULT,
4. a new KVM capability KVM_CAP_USERFAULT.
KVM Userfault does not attempt to catch KVM's own accesses to guest
memory. That is left up to userfaultfd.
When enabling KVM_MEM_USERFAULT for a memslot, the second-stage mappings
are zapped, and new faults will check `userfault_bitmap` to see if the
fault should exit to userspace.
When KVM_MEM_USERFAULT is enabled, only PAGE_SIZE mappings are
permitted.
When disabling KVM_MEM_USERFAULT, huge mappings will be reconstructed
(either eagerly or on-demand; the architecture can decide).
KVM Userfault is not compatible with async page faults. Nikita has
proposed a new implementation of async page faults that is more
userspace-driven that *is* compatible with KVM Userfault[5].
Performance
===========
The takeaways I have are:
1. For cases where lock contention is not a concern, there is a
discernable win because KVM Userfault saves the trip through the
userfaultfd poll/read/WAKE cycle.
2. Using a single userfaultfd without KVM Userfault gets very slow as
the number of vCPUs increases, and it gets even slower when you add
more reader threads. This is due to contention on the userfaultfd
wait_queue locks. This is the contention that KVM Userfault avoids.
Compare this to the multiple-userfaultfd runs; they are much faster
because the wait_queue locks are sharded perfectly (1 per vCPU).
Perfect sharding is only possible because the vCPUs are configured to
touch only their own chunk of memory.
Config:
- 64M per vcpu
- vcpus only touch their 64M (`-b 64M -a`)
- THPs disabled
- MGLRU disabled
Each run used the following command:
./demand_paging_test -b 64M -a -v <#vcpus> \
-s shmem \ # if using shmem
-r <#readers> -u <uffd_mode> \ # if using userfaultfd
-k \ \ # if using KVM Userfault
-m 3 # when on arm64
note: I patched demand_paging_test so that, when using shmem, the page
cache will always be preallocated, not only in the `-u MINOR`
case. Otherwise the comparison would be unfair. I left this patch
out in the selftest commits, but I am happy to add it if it would
be useful.
== x86 (96 LPUs, 48 cores, TDP MMU enabled) ==
-- Anonymous memory, single userfaultfd
userfault mode
vcpus 2 8 64
no userfault 306845 220402 47720
MISSING (single reader) 90724 26059 3029
MISSING 86840 37912 1664
MISSING + KVM UF 143021 104385 34910
KVM UF 160326 128247 39913
-- shmem (preallocated), single userfaultfd
vcpus 2 8 64
no userfault 375130 214635 54420
MINOR (single reader) 102336 31704 3244
MINOR 97981 36982 1673
MINOR + KVM UF 161835 113716 33577
KVM UF 181972 131204 42914
-- shmem (preallocated), multiple userfaultfds
vcpus 2 8 64
no userfault 374060 216108 54433
MINOR 102661 56978 11300
MINOR + KVM UF 167080 123461 48382
KVM UF 180439 122310 42539
== arm64 (96 PEs, AmpereOne) ==
-- shmem (preallocated), single userfaultfd
vcpus: 2 8 64
no userfault 419069 363081 34348
MINOR (single reader) 87421 36147 3764
MINOR 84953 43444 1323
MINOR + KVM UF 164509 139986 12373
KVM UF 185706 122153 12153
-- shmem (preallocated), multiple userfaultfds
vcpus: 2 8 64
no userfault 401931 334142 36117
MINOR 83696 75617 15996
MINOR + KVM UF 176327 115784 12198
KVM UF 190074 126966 12084
This series is based on the latest kvm/next.
[1]: https://lore.kernel.org/kvm/20240710234222.2333120-1-jthoughton@xxxxxxxxxx/
[2]: https://lpc.events/event/18/contributions/1757/
[3]: https://lore.kernel.org/kvm/20241112073837.22284-1-yan.y.zhao@xxxxxxxxx/
[4]: https://lore.kernel.org/all/20240215235405.368539-1-amoorthy@xxxxxxxxxx/
[5]: https://lore.kernel.org/kvm/20241118123948.4796-1-kalyazin@xxxxxxxxxx/#t
James Houghton (13):
KVM: Add KVM_MEM_USERFAULT memslot flag and bitmap
KVM: Add KVM_MEMORY_EXIT_FLAG_USERFAULT
KVM: Allow late setting of KVM_MEM_USERFAULT on guest_memfd memslot
KVM: Advertise KVM_CAP_USERFAULT in KVM_CHECK_EXTENSION
KVM: x86/mmu: Add support for KVM_MEM_USERFAULT
KVM: arm64: Add support for KVM_MEM_USERFAULT
KVM: selftests: Fix vm_mem_region_set_flags docstring
KVM: selftests: Fix prefault_mem logic
KVM: selftests: Add va_start/end into uffd_desc
KVM: selftests: Add KVM Userfault mode to demand_paging_test
KVM: selftests: Inform set_memory_region_test of KVM_MEM_USERFAULT
KVM: selftests: Add KVM_MEM_USERFAULT + guest_memfd toggle tests
KVM: Documentation: Add KVM_CAP_USERFAULT and KVM_MEM_USERFAULT
details
Documentation/virt/kvm/api.rst | 33 +++-
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/mmu.c | 23 ++-
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/mmu/mmu.c | 27 +++-
arch/x86/kvm/mmu/mmu_internal.h | 20 ++-
arch/x86/kvm/x86.c | 36 +++--
include/linux/kvm_host.h | 19 ++-
include/uapi/linux/kvm.h | 6 +-
.../selftests/kvm/demand_paging_test.c | 145 ++++++++++++++++--
.../testing/selftests/kvm/include/kvm_util.h | 5 +
.../selftests/kvm/include/userfaultfd_util.h | 2 +
tools/testing/selftests/kvm/lib/kvm_util.c | 42 ++++-
.../selftests/kvm/lib/userfaultfd_util.c | 2 +
.../selftests/kvm/set_memory_region_test.c | 33 ++++
virt/kvm/Kconfig | 3 +
virt/kvm/kvm_main.c | 47 +++++-
17 files changed, 409 insertions(+), 36 deletions(-)
base-commit: 4d911c7abee56771b0219a9fbf0120d06bdc9c14
--
2.47.0.338.g60cca15819-goog