[PATCH RFC 09/12] Documentation: KVM: Document synchronization for managing guest faults

From: Ackerley Tng via B4 Relay

Date: Wed May 27 2026 - 11:39:03 EST


From: Ackerley Tng <ackerleytng@xxxxxxxxxx>

Document how synchronization is used while managing guest faults centrally
so code comments can point users at a central place.

Signed-off-by: Ackerley Tng <ackerleytng@xxxxxxxxxx>
---
Documentation/virt/kvm/locking.rst | 108 +++++++++++++++++++++++++++++++++++++
1 file changed, 108 insertions(+)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index f12664443e913..0663ccfe0633d 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -339,3 +339,111 @@ time it will be set using the Dirty tracking mechanism described above.
cpu_hotplug_lock is held, e.g. from cpufreq_boost_trigger_state(), and many
operations need to take cpu_hotplug_lock when loading a vendor module, e.g.
updating static calls.
+
+4. Synchronization while managing guest faults
+----------------------------------------------
+
+This section explains the intersection of these synchronization mechanisms:
+
+- ``kvm->srcu`` (for memslots)
+- ``kvm->mmu_invalidate_*`` (pending invalidations)
+- ``kvm->mn_*`` (synchronization for ``kvm->mmu_invalidate_*``)
+
+4.1 Overview
+^^^^^^^^^^^^
+
+KVM resolves guest page faults by translating the Guest Frame Number (GFN) into
+a Page Frame Number (PFN) via memslots and then populating its shadow page
+tables with the resulting mapping.
+
+While handling the guest page fault, KVM must ensure a consistent view of the
+active memslots container, so KVM takes ``srcu_read_lock(&kvm->srcu);``.
+
+Guest page fault handling can race with some request from host userspace to
+invalidate shadow page tables. These requests originate from a few places, such
+as
+
+1. MMU Notifiers: KVM registers callbacks with the kernel’s memory management
+ subsystem to know when there are changes to mappings in the host userspace
+ page tables.
+2. Memslot Updates: The host userspace VMM, such as QEMU may use the
+ ``KVM_SET_USER_MEMORY_REGION`` ioctl to add, delete, or move a memslot. KVM
+ must zap the affected shadow page tables to ensure the guest doesn't access
+ stale mappings.
+3. Memory Attribute Changes: The ``KVM_SET_MEMORY_ATTRIBUTES`` ioctl allows
+ userspace to change attributes for a range of guest memory (e.g., setting a
+ range as "private" for Confidential Computing). This also requires
+ invalidating existing shadow mappings.
+
+When such a race occurs, KVM optimistically allows the faulting logic to
+proceed, but just before committing the fault, KVM will check for a pending
+invalidation, and retry the fault process if there is a pending invalidation
+affecting the GFN where the fault occurred.
+
+4.2 Tracking pending invalidations with ``kvm->mmu_invalidate*`` fields
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A "pending invalidation" is determined using a combination of
+
+- ``kvm->mmu_invalidate_in_progress``
+- ``kvm->mmu_invalidate_range_start`` and ``kvm->mmu_invalidate_range_end``
+- ``kvm->mmu_invalidate_seq``
+
+``is_page_fault_stale()`` shows how the above fields are used to determine if
+the page fault is stale and requires a retry.
+
+To protect the above combination of fields, a lock is used, which is the
+``kvm->mmu_lock``.
+
+4.2.1 Derived information vs pending invalidations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally, the result of any information derived from GFN aka page
+attribute/page metadata lookups may race with invalidations. Here are some
+examples of lookups:
+
+- ``host_pfn_mapping_level()`` uses memslot information to find the mapping
+ level of pages in host userspace page tables. If there's an invalidation, the
+ pages that were mapped would no longer be mapped and hence the mapping level
+ result would be stale.
+
+There are several ways to ensure valid results:
+
+- Check ``mmu_invalidate_retry_gfn()`` after grabbing the result, before
+ consuming it. In this case, ``mmu_lock`` doesn't need to be held during the
+ lookup, but it does need to be held while checking the MMU notifier. KVM's
+ guest page fault handling uses this option.
+- Hold ``mmu_lock`` AND ensure there is no in-progress MMU notifier invalidation
+ event for the hva. This can be done by explicit checking the MMU notifier or
+ by ensuring that KVM already has a valid mapping that covers the
+ hva. ``kvm_mmu_recover_huge_pages()`` uses this option.
+
+4.3 Further optimization: ignoring invalidations if there is no matching memslot
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Invalidation is only really required when the invalidated memory range overlaps
+with some memslot. Without a matching memslot, the invalidation request could
+actually just be ignored. Hence, KVM only updates the ``kvm->mmu_invalidate_*``
+fields and takes ``kvm->mmu_lock`` if it finds a matching memslot.
+
+This creates another problem: if memslots are updated while there is an ongoing
+invalidation, then the updates to the fields and the lock would be imbalanced.
+
+4.4 Synchronization for invalidation lock/fields: ``kvm->mn_*``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To make sure the updates to the invalidation lock/fields are balanced, KVM has a
+further layer of synchronization. ``kvm_swap_active_memslots()`` enforces that
+changes to memslots are only committed once all pending invalidations are
+complete.
+
+In other words, ``kvm->mn_*`` ensures the following does not happen:
+
+1. Some memslot existed, causing a pending invalidation request to be recorded
+ in the ``kvm->mmu_invalidate_*`` fields
+2. Memslot got removed, so the invalidation request was never removed from the
+ ``kvm->mmu_invalidate_*`` fields.
+
+In addition, ``kvm_swap_active_memslots()`` also enforces that changes to
+memslots are complete before doing ``synchronize_srcu(&kvm->srcu)`` to make sure
+running readers of the old memslots container are done before freeing it.

--
2.54.0.823.g6e5bcc1fc9-goog