[PATCH 2/3] mm/pagewalk: let folio_walk_start() run under the per-VMA lock
From: Rik van Riel
Date: Wed Jun 24 2026 - 21:51:54 EST
folio_walk_start() asserts the mmap lock is held. For callers that only
need to read a single, already-present page, the mmap lock is a heavy and
often badly contended hammer. Such a caller can instead hold the per-VMA
lock, which keeps the VMA itself stable.
The per-VMA lock does not, however, keep the page tables walked below that
VMA from being freed. A concurrent munmap() or THP collapse of an
adjacent region in the same mm can free a shared upper-level table, and
THP collapse (collapse_huge_page() -> retract_page_tables()) frees page
tables of VMAs whose lock it does not hold. Page table freeing
synchronizes against lockless walkers the way gup_fast relies on:
tlb_remove_table_sync_one() sends an IPI and waits for every CPU to enable
interrupts, so a walker that keeps interrupts disabled across the walk
cannot be observing a table that is about to be freed. rcu_read_lock() is
not sufficient -- it does not block that IPI -- so the caller must keep
interrupts disabled, not merely hold an RCU read-side critical section.
Add an FW_VMA_LOCKED flag. When passed, folio_walk_start() asserts the
per-VMA lock and that interrupts are disabled, instead of asserting the
mmap lock; it requires CONFIG_MMU_GATHER_RCU_TABLE_FREE and refuses
hugetlb VMAs (PMD sharing maps page tables this VMA's lock does not
cover). The caller must keep interrupts disabled until folio_walk_end().
No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
---
include/linux/pagewalk.h | 7 +++++++
mm/pagewalk.c | 29 +++++++++++++++++++++++++++--
2 files changed, 34 insertions(+), 2 deletions(-)
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b41d7265c01b..d0387470d732 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -150,6 +150,13 @@ typedef int __bitwise folio_walk_flags_t;
/* Walk shared zeropages (small + huge) as well. */
#define FW_ZEROPAGE ((__force folio_walk_flags_t)BIT(0))
+/*
+ * The caller holds the per-VMA lock instead of the mmap lock, with interrupts
+ * disabled across the walk (until folio_walk_end()) to serialize against page
+ * table freeing, the same way gup_fast does. Only valid with RCU-freed page
+ * tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb.
+ */
+#define FW_VMA_LOCKED ((__force folio_walk_flags_t)BIT(1))
enum folio_walk_level {
FW_LEVEL_PTE,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..ab1e81983cb8 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -890,7 +890,10 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
* huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
* not correspond to the first physical entry of a logical hugetlb entry.
*
- * The mmap lock must be held in read mode.
+ * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is
+ * passed, the VMA's per-VMA lock must be held and interrupts must be disabled
+ * across the walk and until folio_walk_end() (only supported with RCU-freed page
+ * tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb).
*
* Return: folio pointer on success, otherwise NULL.
*/
@@ -908,7 +911,29 @@ struct folio *folio_walk_start(struct folio_walk *fw,
pgd_t *pgdp;
p4d_t *p4dp;
- mmap_assert_locked(vma->vm_mm);
+ if (flags & FW_VMA_LOCKED) {
+ /*
+ * Lockless walk under the per-VMA lock instead of the mmap
+ * lock. The VMA lock keeps the VMA stable, but the page tables
+ * walked below it can still be freed concurrently: a munmap() or
+ * THP collapse of an adjacent region in the same mm can free a
+ * shared upper-level table, and collapse_huge_page() ->
+ * retract_page_tables() frees page tables of VMAs whose lock it
+ * does not hold. Page table freeing serializes against lockless
+ * walkers via tlb_remove_table_sync_one(), which IPIs and waits
+ * for every CPU to enable interrupts; an RCU read-side critical
+ * section does not block that IPI, so the caller must keep
+ * interrupts disabled across the whole walk, like gup_fast.
+ * Hugetlb (PMD sharing) maps page tables not covered by this
+ * VMA's lock and is not supported.
+ */
+ VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE));
+ VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma));
+ lockdep_assert_irqs_disabled();
+ vma_assert_locked(vma);
+ } else {
+ mmap_assert_locked(vma->vm_mm);
+ }
vma_pgtable_walk_begin(vma);
if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
--
2.53.0-Meta