[PATCH 1/3] mm/hmm: Add hmm_range_fault_unlockable() for mmap lock-drop support

From: Stanislav Kinsburskii

Date: Thu Apr 30 2026 - 21:20:40 EST

Add hmm_range_fault_unlockable(), a new HMM entry point that allows the
mmap read lock to be dropped during page faults. This follows the
int *locked pattern from get_user_pages_remote() in mm/gup.c: callers
pass an int *locked variable indicating they can handle the lock being
dropped.

When locked is non-NULL, hmm_vma_fault() adds FAULT_FLAG_ALLOW_RETRY
and FAULT_FLAG_KILLABLE to the fault flags passed to handle_mm_fault().
If the fault handler drops the mmap lock (returning VM_FAULT_RETRY or
VM_FAULT_COMPLETED), the function sets *locked = 0 and returns 0,
signalling the caller to restart its walk with a fresh notifier
sequence. Fatal signals are checked before returning, matching GUP
behavior. The caller is responsible for re-acquiring the lock and
restarting from the beginning, since previously collected PFNs may be
stale after the lock was dropped.

The existing hmm_range_fault() is refactored into a thin wrapper that
calls hmm_range_fault_unlockable(range, NULL). Passing NULL means
FAULT_FLAG_ALLOW_RETRY is never set, preserving existing behavior for
all current callers with no functional change.

Faulting hugetlb pages is not supported on the unlockable path: if a
hugetlb page requires faulting, -EFAULT is returned. This is because
walk_hugetlb_range() holds hugetlb_vma_lock_read across the callback
and unconditionally unlocks on return; if the mmap lock is dropped
inside the callback the VMA may be freed, making the walk framework's
unlock a use-after-free. Hugetlb pages already present in page tables
are handled normally.

Documentation/mm/hmm.rst is updated with a new section describing the
unlockable API, its usage pattern, and the hugetlb limitation.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx>
---
Documentation/mm/hmm.rst | 89 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/hmm.h | 1 +
mm/hmm.c | 91 +++++++++++++++++++++++++++++++++++++++++-----
3 files changed, 172 insertions(+), 9 deletions(-)

diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst
index 7d61b7a8b65b7..13874b4dfd5f4 100644
--- a/Documentation/mm/hmm.rst
+++ b/Documentation/mm/hmm.rst
@@ -208,6 +208,95 @@ invalidate() callback. That lock must be held before calling
mmu_interval_read_retry() to avoid any race with a concurrent CPU page table
update.

+Scalable lock-drop support (hmm_range_fault_unlockable)
+=======================================================
+
+Some page fault handlers (e.g., userfaultfd) require the mmap lock to be
+dropped during fault resolution. Drivers that need to support such mappings
+can use::
+
+ int hmm_range_fault_unlockable(struct hmm_range *range, int *locked);
+
+This follows the same ``int *locked`` pattern used by ``get_user_pages_remote()``
+in ``mm/gup.c``. The caller sets ``*locked = 1`` and holds the mmap read lock
+before calling. If the lock is dropped during the fault (VM_FAULT_RETRY or
+VM_FAULT_COMPLETED), the function returns 0 with ``*locked = 0``, signalling
+the caller to restart its walk with a fresh notifier sequence. The caller is
+responsible for re-acquiring the lock and restarting from the beginning, since
+previously collected PFNs may be stale.
+
+The usage pattern is::
+
+ int driver_populate_range_unlockable(...)
+ {
+ struct hmm_range range;
+ int locked;
+ ...
+
+ range.notifier = &interval_sub;
+ range.start = ...;
+ range.end = ...;
+ range.hmm_pfns = ...;
+
+ if (!mmget_not_zero(interval_sub->notifier.mm))
+ return -EFAULT;
+
+ again:
+ range.notifier_seq = mmu_interval_read_begin(&interval_sub);
+ locked = 1;
+ mmap_read_lock(mm);
+ ret = hmm_range_fault_unlockable(&range, &locked);
+ if (locked)
+ mmap_read_unlock(mm);
+ if (ret) {
+ if (ret == -EBUSY)
+ goto again;
+ return ret;
+ }
+ if (!locked)
+ goto again;
+
+ take_lock(driver->update);
+ if (mmu_interval_read_retry(&ni, range.notifier_seq) {
+ release_lock(driver->update);
+ goto again;
+ }
+
+ /* Use pfns array content to update device page table,
+ * under the update lock */
+
+ release_lock(driver->update);
+ return 0;
+ }
+
+Passing ``locked = NULL`` to ``hmm_range_fault_unlockable()`` is equivalent to
+calling ``hmm_range_fault()`` — the lock will never be dropped.
+
+Note: hugetlb pages are not supported with the unlockable path. If a hugetlb
+page requires faulting during an ``hmm_range_fault_unlockable()`` call,
+``-EFAULT`` is returned. Hugetlb pages that are already present in page tables
+are handled normally.
+
+This limitation exists because ``walk_hugetlb_range()`` in the page walk
+framework holds ``hugetlb_vma_lock_read`` across the callback and unconditionally
+unlocks on return. If the mmap lock is dropped inside the callback (via
+VM_FAULT_RETRY), the VMA may be freed before the walk framework's unlock,
+resulting in a use-after-free. Possible approaches to lift this limitation in
+the future:
+
+1. Extend the walk framework to allow callbacks to signal that the hugetlb vma
+ lock was dropped (e.g., a flag in ``struct mm_walk`` that tells
+ ``walk_hugetlb_range()`` to skip the unlock).
+
+2. Bypass ``walk_page_range()`` for hugetlb pages in the unlockable path and
+ walk hugetlb page tables directly with custom lock management (similar to
+ how GUP handles hugetlb without the walk framework).
+
+3. Re-acquire the mmap lock before returning from the hugetlb callback (like
+ ``fixup_user_fault()``), ensuring the VMA remains valid for the walk
+ framework's unlock. This changes the "never re-take" contract and would
+ require callers to handle hugetlb differently.
+
Leverage default_flags and pfn_flags_mask
=========================================

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index db75ffc949a7a..46e581865c48a 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -123,6 +123,7 @@ struct hmm_range {
* Please see Documentation/mm/hmm.rst for how to use the range API.
*/
int hmm_range_fault(struct hmm_range *range);
+int hmm_range_fault_unlockable(struct hmm_range *range, int *locked);

/*
* HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
diff --git a/mm/hmm.c b/mm/hmm.c
index 5955f2f0c83db..9bf2fa37f2efd 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -33,6 +33,7 @@
struct hmm_vma_walk {
struct hmm_range *range;
unsigned long last;
+ int *locked;
};

enum {
@@ -86,10 +87,28 @@ static int hmm_vma_fault(unsigned long addr, unsigned long end,
fault_flags |= FAULT_FLAG_WRITE;
}

- for (; addr < end; addr += PAGE_SIZE)
- if (handle_mm_fault(vma, addr, fault_flags, NULL) &
- VM_FAULT_ERROR)
+ if (hmm_vma_walk->locked)
+ fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+
+ for (; addr < end; addr += PAGE_SIZE) {
+ vm_fault_t ret;
+
+ ret = handle_mm_fault(vma, addr, fault_flags, NULL);
+
+ if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) {
+ /*
+ * The mmap lock has been dropped by the fault handler.
+ * Record the failing address and signal lock-drop to
+ * the caller.
+ */
+ *hmm_vma_walk->locked = 0;
+ hmm_vma_walk->last = addr;
+ return -EAGAIN;
+ }
+
+ if (ret & VM_FAULT_ERROR)
return -EFAULT;
+ }
return -EBUSY;
}

@@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
if (required_fault) {
int ret;

+ /*
+ * Faulting hugetlb pages on the unlockable path is not
+ * supported. The walk framework holds hugetlb_vma_lock_read
+ * which must be dropped before handle_mm_fault, but if the
+ * mmap lock is also dropped (VM_FAULT_RETRY), the vma may
+ * be freed and the walk framework's unconditional unlock
+ * becomes a use-after-free.
+ */
+ if (hmm_vma_walk->locked)
+ return -EFAULT;
+
spin_unlock(ptl);
hugetlb_vma_unlock_read(vma);
/*
@@ -655,14 +685,49 @@ static const struct mm_walk_ops hmm_walk_ops = {
*
* This is similar to get_user_pages(), except that it can read the page tables
* without mutating them (ie causing faults).
+ *
+ * The mmap lock must be held by the caller and will remain held on return.
+ * For a variant that allows the mmap lock to be dropped during faults (e.g.,
+ * for userfaultfd support), see hmm_range_fault_unlockable().
*/
int hmm_range_fault(struct hmm_range *range)
{
+ return hmm_range_fault_unlockable(range, NULL);
+}
+EXPORT_SYMBOL(hmm_range_fault);
+
+/**
+ * hmm_range_fault_unlockable - fault a range with mmap lock-drop support
+ * @range: argument structure
+ * @locked: pointer to lock state variable (input: 1; output: 0 if lock
+ * was dropped)
+ *
+ * Similar to hmm_range_fault() but allows the mmap lock to be dropped during
+ * page faults. This enables support for userfaultfd-backed mappings and other
+ * cases where handle_mm_fault() may need to release the mmap lock.
+ *
+ * The caller must hold the mmap read lock and set *locked = 1 before calling.
+ * On return:
+ * - *locked == 1: mmap lock is still held, return value has normal semantics
+ * - *locked == 0: mmap lock was dropped. The caller must re-acquire the lock
+ * and restart the operation. Return value is -EBUSY in this case.
+ *
+ * When the lock is dropped internally, this function will attempt to
+ * re-acquire it and retry the fault with FAULT_FLAG_TRIED set. If the retry
+ * also results in lock-drop (possible but unusual), or if a fatal signal is
+ * pending, the function returns with *locked == 0.
+ *
+ * Returns 0 on success or a negative error code. See hmm_range_fault() for
+ * the full list of possible errors.
+ */
+int hmm_range_fault_unlockable(struct hmm_range *range, int *locked)
+{
+ struct mm_struct *mm = range->notifier->mm;
struct hmm_vma_walk hmm_vma_walk = {
.range = range,
.last = range->start,
+ .locked = locked,
};
- struct mm_struct *mm = range->notifier->mm;
int ret;

mmap_assert_locked(mm);
@@ -674,16 +739,24 @@ int hmm_range_fault(struct hmm_range *range)
return -EBUSY;
ret = walk_page_range(mm, hmm_vma_walk.last, range->end,
&hmm_walk_ops, &hmm_vma_walk);
+ if (ret == -EAGAIN) {
+ /*
+ * The mmap lock was dropped during the fault
+ * (e.g. userfaultfd). Signal the caller to restart
+ * by returning with *locked = 0.
+ */
+ if (fatal_signal_pending(current))
+ return -EINTR;
+ return 0;
+ }
/*
- * When -EBUSY is returned the loop restarts with
- * hmm_vma_walk.last set to an address that has not been stored
- * in pfns. All entries < last in the pfn array are set to their
- * output, and all >= are still at their input values.
+ * -EBUSY: page table changed during the walk.
+ * Restart from hmm_vma_walk.last.
*/
} while (ret == -EBUSY);
return ret;
}
-EXPORT_SYMBOL(hmm_range_fault);
+EXPORT_SYMBOL(hmm_range_fault_unlockable);

/**
* hmm_dma_map_alloc - Allocate HMM map structure