[RFC, PATCH 12/12] Documentation/userfaultfd: document working set tracking

From: Kiryl Shutsemau (Meta)

Date: Tue Apr 14 2026 - 10:33:26 EST


Document the new userfaultfd capabilities for VM working set tracking:

- UFFD_FEATURE_MINOR_ANON and UFFD_FEATURE_MINOR_ASYNC for anonymous
minor fault interception using the PROT_NONE hinting mechanism.
- UFFDIO_DEACTIVATE for marking pages as inaccessible while keeping
them resident.
- Sync and async fault resolution modes, and UFFDIO_SET_MODE for
runtime toggling between them.
- PAGEMAP_SCAN with PAGE_IS_UFFD_DEACTIVATED for cold page detection.
- Cleanup semantics on unregister and close.
- NUMA balancing interaction on anonymous VMAs.
- Complete VMM workflow example for the cold page eviction lifecycle,
with a note on shmem applicability.

Update the feature flag descriptions at the top of the guide to
reference the new section.

Signed-off-by: Kiryl Shutsemau (Meta) <kas@xxxxxxxxxx>
Assisted-by: Claude:claude-opus-4-6
---
Documentation/admin-guide/mm/userfaultfd.rst | 141 ++++++++++++++++++-
1 file changed, 140 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index e5cc8848dcb3..fc89e029060c 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -111,7 +111,11 @@ events, except page fault notifications, may be generated:
- ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
- support for shmem virtual memory areas.
+ support for shmem virtual memory areas. ``UFFD_FEATURE_MINOR_ANON``
+ extends minor fault support to anonymous private memory using
+ PROT_NONE hinting; see the `Anonymous Minor Faults`_ section.
+ ``UFFD_FEATURE_MINOR_ASYNC`` enables asynchronous auto-resolution for
+ anonymous minor faults (requires ``UFFD_FEATURE_MINOR_ANON``).

- ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an
existing page contents from userspace.
@@ -297,6 +301,141 @@ transparent to the guest, we want that same address range to act as if it was
still poisoned, even though it's on a new physical host which ostensibly
doesn't have a memory error in the exact same spot.

+Anonymous Minor Faults
+----------------------
+
+``UFFD_FEATURE_MINOR_ANON`` enables ``UFFDIO_REGISTER_MODE_MINOR`` on
+anonymous private memory. Unlike shmem/hugetlbfs minor faults (where a page
+exists in the page cache but has no PTE), anonymous minor faults use the
+PROT_NONE hinting mechanism: pages remain resident in memory with their PFNs
+preserved in the PTEs, but access permissions are removed so the next access
+triggers a fault.
+
+This is designed for VM memory managers that need to track the working set of
+anonymous guest memory for cold page eviction to tiered or remote storage.
+
+**Setup:**
+
+1. Open a userfaultfd and enable ``UFFD_FEATURE_MINOR_ANON`` (and optionally
+ ``UFFD_FEATURE_MINOR_ASYNC``) via ``UFFDIO_API``.
+
+2. Register the guest memory range with ``UFFDIO_REGISTER_MODE_MINOR``
+ (and ``UFFDIO_REGISTER_MODE_MISSING`` if evicted pages will need to be
+ fetched back from storage).
+
+**Deactivation:**
+
+Use ``UFFDIO_DEACTIVATE`` to mark pages as inaccessible. This ioctl takes a
+``struct uffdio_range`` and sets PROT_NONE on all present PTEs in the range,
+using the same mechanism as NUMA balancing. Pages stay resident and their
+physical frames are preserved — only access permissions are removed.
+
+**Fault Handling:**
+
+When a deactivated page is accessed:
+
+- **Sync mode** (default): The faulting thread blocks and a
+ ``UFFD_PAGEFAULT_FLAG_MINOR`` message is delivered to the userfaultfd
+ handler. The handler resolves the fault with ``UFFDIO_CONTINUE``, which
+ restores the PTE permissions and wakes the faulting thread.
+
+- **Async mode** (``UFFD_FEATURE_MINOR_ASYNC``): The kernel automatically
+ restores PTE permissions and the thread continues without blocking. No
+ message is delivered to the handler.
+
+**Cold Page Detection with PAGEMAP_SCAN:**
+
+After deactivating a range and letting the application run, use the
+``PAGEMAP_SCAN`` ioctl on ``/proc/pid/pagemap`` with the
+``PAGE_IS_UFFD_DEACTIVATED`` category flag to efficiently find pages that were
+never re-accessed (cold pages)::
+
+ struct pm_scan_arg arg = {
+ .size = sizeof(arg),
+ .start = guest_mem_start,
+ .end = guest_mem_end,
+ .vec = (uint64_t)regions,
+ .vec_len = regions_len,
+ .category_mask = PAGE_IS_UFFD_DEACTIVATED,
+ .return_mask = PAGE_IS_UFFD_DEACTIVATED,
+ };
+ long n = ioctl(pagemap_fd, PAGEMAP_SCAN, &arg);
+
+The returned ``page_region`` array contains contiguous cold ranges that can
+then be evicted.
+
+**Cleanup:**
+
+When the userfaultfd is closed or the range is unregistered, all protnone
+PTEs are automatically restored to their normal VMA permissions. This
+prevents pages from becoming permanently inaccessible.
+
+**Interaction with NUMA Balancing:**
+
+NUMA balancing is automatically disabled on anonymous VMAs registered with
+``UFFDIO_REGISTER_MODE_MINOR``, since both mechanisms use PROT_NONE PTEs
+as access hints and would interfere with each other. Shmem VMAs are not
+affected since ``UFFDIO_DEACTIVATE`` zaps PTEs there instead of using
+PROT_NONE.
+
+**VMM Working Set Tracking Workflow:**
+
+A typical VMM lifecycle for cold page eviction to tiered storage::
+
+ /* One-time setup */
+ uffd = userfaultfd(O_CLOEXEC | O_NONBLOCK);
+ ioctl(uffd, UFFDIO_API, &(struct uffdio_api){
+ .api = UFFD_API,
+ .features = UFFD_FEATURE_MINOR_ANON |
+ UFFD_FEATURE_MINOR_ASYNC,
+ });
+ ioctl(uffd, UFFDIO_REGISTER, &(struct uffdio_register){
+ .range = { guest_mem, guest_size },
+ .mode = UFFDIO_REGISTER_MODE_MINOR |
+ UFFDIO_REGISTER_MODE_MISSING,
+ });
+
+ /* Tracking loop */
+ while (vm_running) {
+ /* 1. Detection phase (async — no vCPU stalls) */
+ ioctl(uffd, UFFDIO_DEACTIVATE, &full_range);
+ sleep(tracking_interval);
+
+ /* 2. Find cold pages */
+ ioctl(pagemap_fd, PAGEMAP_SCAN, &(struct pm_scan_arg){
+ .category_mask = PAGE_IS_UFFD_DEACTIVATED,
+ ...
+ });
+
+ /* 3. Switch to sync for safe eviction */
+ ioctl(uffd, UFFDIO_SET_MODE,
+ &(struct uffdio_set_mode){
+ .disable = UFFD_FEATURE_MINOR_ASYNC });
+
+ /* 4. Evict cold pages (vCPU faults block in handler) */
+ for each cold range:
+ pwrite(storage_fd, cold_addr, len, offset);
+ madvise(cold_addr, len, MADV_DONTNEED);
+
+ /* 5. Resume async tracking */
+ ioctl(uffd, UFFDIO_SET_MODE,
+ &(struct uffdio_set_mode){
+ .enable = UFFD_FEATURE_MINOR_ASYNC });
+ }
+
+During step 4, if a vCPU accesses a cold page being evicted, it blocks
+with a ``UFFD_PAGEFAULT_FLAG_MINOR`` fault. The handler can either let it
+wait (the eviction completes, ``MADV_DONTNEED`` fires, the fault retries as
+``MISSING`` and is resolved with ``UFFDIO_COPY`` from storage) or resolve
+it immediately with ``UFFDIO_CONTINUE``.
+
+The same workflow applies to shmem-backed guest memory
+(``UFFD_FEATURE_MINOR_SHMEM``). The only difference is the
+``PAGEMAP_SCAN`` mask for cold page detection: use
+``!PAGE_IS_PRESENT`` instead of ``PAGE_IS_UFFD_DEACTIVATED``, since
+``UFFDIO_DEACTIVATE`` zaps PTEs on shmem (pages stay in page cache)
+rather than setting PROT_NONE.
+
QEMU/KVM
========

--
2.51.2