Re: [PATCH v4] mm: pgtable: free kernel page tables via RCU to fix ptdump UAF

From: Lorenzo Stoakes

Date: Mon Jun 15 2026 - 15:38:20 EST

TL;DR you're fixing what was an existing race because ptdump touches memory
it doesn't own, unsafely. 5ba2f0a15564 just made it more likely by
increasing the race window.

Your commit message should reflect this.

Also you should have ptdump take the RCU lock and assert the RCU lock in
walk_page_range_debug() and update comments to reflect this, rather than
taking the RCU lock there.

On Sat, Jun 13, 2026 at 08:35:47PM +0100, David Carlier wrote:
> ptdump walks the kernel page tables holding only the init_mm mmap lock

Can you be specific, i.e. mention that ptdump_walk_pgd() takes these locks.

> and the memory hotplug lock. Neither of those stops vmalloc or ioremap
> from freeing a kernel PTE page underneath the walk. When

I think saying 'or ioremap' is not really useful here. It's vmalloc.

> vmap_try_huge_pmd() installs a huge mapping it collapses the existing
> PTE table and frees it through pmd_free_pte_page(), and on x86 that
> happens without the init_mm mmap lock. syzbot caught the resulting

And on other arches we acquire the init_mm mmap_lock? I don't think we do
do we?

> use after free in ptdump_pte_entry() reading a page table that had
> already been freed.

It's usually best practice to respond to the syzbot thread explaining your
reasoning, as anybody else looking at the report doesn't know that you've
provided a diagnosis, so it just seems like an open bug right now.

(If you're leaning heavily on Claude, I'd strongly suggest that you ensure
you have a good understanding of what's going here without it).

>
> pagetable_free_kernel() used to free the page immediately on

Not used to, currently does. This patch is the change...

> configurations without CONFIG_ASYNC_KERNEL_PGTABLE_FREE, and on the
> async ones it only batched a TLB flush before freeing. In both cases a

'And on the async ones'? What does this mean? Can we be clearer here
please.

I think it's horribly confused really.

You need to identify the _actual_ problem.

It seems to me that what's going on here is that this bug _already
existed_ previously.

It's just that 5ba2f0a15564 made it (much?) _more likely_ by increasing the
race window.

The underlying problem is that ptdump is accessing ranges of memory it
shouldn't be, completely unprotected.

> lockless walker could still be dereferencing the page.

You say this, then only change the logic for for the ptdump case...

I find it odd also that we take all the (previously assumed to be) required
locks in ptdump_walk_pgd() - the memory hotplug and mmap write lock, but
now you insert _another_ lock in walk_page_range_debug()?

So why are we taking some locks in the caller, and others in the function
that is called?

In fact the convention in pagewalk.c is that we _assert_ locks not take
them.

>
> Defer the free by a grace period instead. pagetable_free_kernel() now
> hands every kernel page table to call_rcu(), so the page stays valid
> until any walk that may have observed it has finished. The async path
> keeps doing its TLB flush first and then queues the RCU free per page.
>
> On the read side, walk_page_range_debug() takes the RCU read lock
> around the kernel walk through the new walk_kernel_page_table_range_rcu()
> helper. A walker either sees the cleared PMD and skips the page, or
> keeps it alive until it drops the lock. The plain
> walk_kernel_page_table_range() stays as it is for callers that already
> own their range and cannot race a free, such as the arm64 page table
> split paths.

You didn't actually explain as I requested what exactly 'owning a range'
means and why that protects them from this race.

You need to _spell out_ convincingly why only walk_page_range_debug() need
be changed.

In fact, you need to list the callers of all the functions that can walk
kernel page tables and explain, convincingly, why they are all safe and do
not require an RCU read lock.

- All the callers of walk_kernel_page_table_range()
https://elixir.bootlin.com/linux/v7.1/A/ident/walk_kernel_page_table_range
- All the callers of walk_kernel_page_table_range_lockless()
https://elixir.bootlin.com/linux/v7.1/C/ident/walk_kernel_page_table_range_lockless

I think the argument hinges on _ownership_.

Using Claude myself, suggests so (see below). But I'd want a convincing
argument here.

┌───────────────────────────────────────────┬────────────────────────────────────┬──────────┐
│ Caller │ Range it walks │ Exposed? │
├───────────────────────────────────────────┼────────────────────────────────────┼──────────┤
│ arm64 __change_memory_common (set_memory) │ caller-owned │ No │
├───────────────────────────────────────────┼────────────────────────────────────┼──────────┤
│ arm64 range_split_to_ptes (block split) │ caller-owned │ No │
├───────────────────────────────────────────┼────────────────────────────────────┼──────────┤
│ riscv __set_memory │ caller-owned │ No │
├───────────────────────────────────────────┼────────────────────────────────────┼──────────┤
│ loongarch set_memory │ caller-owned │ No │
├───────────────────────────────────────────┼────────────────────────────────────┼──────────┤
│ openrisc DMA set-uncached │ caller-owned │ No │
├───────────────────────────────────────────┼────────────────────────────────────┼──────────┤
│ hugetlb_vmemmap remap │ caller-owned (one folio's vmemmap) │ No │
├───────────────────────────────────────────┼────────────────────────────────────┼──────────┤
│ ptdump (walk_page_range_debug) │ the entire kernel address space │ Yes │
└───────────────────────────────────────────┴────────────────────────────────────┴──────────┘

>
> Fixes: 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")

It's not really fixing this commit honestly. But I guess it's a sensible
fiction, since it increased the race window.

> Reported-by: syzbot+fd95a72470f5a44e464c@xxxxxxxxxxxxxxxxxxxxxxxxx
> Closes: https://lore.kernel.org/all/6a287988.39669fcc.33b062.00a0.GAE@xxxxxxxxxx/T/
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: David Carlier <devnexen@xxxxxxxxx>
> ---
> v4: defer the free in both the async and non async configs, not just
> the async one. Move the walk under a named
> walk_kernel_page_table_range_rcu() helper instead of open coding
> rcu_read_lock() in walk_page_range_debug().
> v3: take rcu_read_lock() in the init_mm branch of
> walk_page_range_debug() rather than inside the lockless walker,
> which the arm64 split paths also use with GFP_PGTABLE_KERNEL and
> can sleep.
> v2: use call_rcu() instead of synchronize_rcu().
> ---
> include/linux/mm.h | 7 -------
> mm/pagewalk.c | 18 ++++++++++++++++--
> mm/pgtable-generic.c | 21 ++++++++++++++++++++-
> 3 files changed, 36 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 485df9c2dbdd..79408a17a1b0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3695,14 +3695,7 @@ static inline void __pagetable_free(struct ptdesc *pt)
> __free_pages(page, compound_order(page));
> }
>
> -#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
> void pagetable_free_kernel(struct ptdesc *pt);
> -#else
> -static inline void pagetable_free_kernel(struct ptdesc *pt)
> -{
> - __pagetable_free(pt);
> -}
> -#endif
> /**
> * pagetable_free - Free pagetables
> * @pt: The page table descriptor
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 3ae2586ff45b..5b5807a88394 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -664,6 +664,19 @@ int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end
> return walk_pgd_range(start, end, &walk);
> }
>

No comments explaining anything?

> +static int walk_kernel_page_table_range_rcu(unsigned long start, unsigned long end,
> + const struct mm_walk_ops *ops, pgd_t *pgd,
> + void *private)

I don't understand your whitespace here, why is void *private on a random newline?

Also this seems to be far less code/comment than the previous version so
separation doesn't make sense here any more...

> +{
> + int err;
> +
> + rcu_read_lock();
> + err = walk_kernel_page_table_range(start, end, ops, pgd, private);
> + rcu_read_unlock();

As mentioned previously, let's just assert the RCU lock is held.

> +
> + return err;
> +}
> +
> /**
> * walk_page_range_debug - walk a range of pagetables not backed by a vma
> * @mm: mm_struct representing the target process of page table walk
> @@ -693,8 +706,9 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
>
> /* For convenience, we allow traversal of kernel mappings. */
> if (mm == &init_mm)
> - return walk_kernel_page_table_range(start, end, ops,
> - pgd, private);
> + return walk_kernel_page_table_range_rcu(start, end, ops, pgd,
> + private);
> +
> if (start >= end || !walk.mm)
> return -EINVAL;
> if (!check_ops_safe(ops))
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index b91b1a98029c..d45a556b4021 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -410,6 +410,13 @@ pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> goto again;
> }
>
> +static void kernel_pgtable_free_rcu(struct rcu_head *head)
> +{
> + struct ptdesc *pt = container_of(head, struct ptdesc, pt_rcu_head);
> +
> + __pagetable_free(pt);
> +}
> +
> #ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
> static void kernel_pgtable_work_func(struct work_struct *work);
>
> @@ -434,8 +441,15 @@ static void kernel_pgtable_work_func(struct work_struct *work)
> spin_unlock(&kernel_pgtable_work.lock);
>
> iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
> +
> + /*
> + * Lockless kernel page table walkers (ptdump, and any other user of
> + * walk_kernel_page_table_range_lockless()) dereference these pages
> + * under rcu_read_lock(). Free them after a grace period so a walker

Err what? walk_kernel_page_table_range_lockless() callers? Now you're
contradicting yourself...

Instead, refer to _debug_ kernel page table walkers may be walking
non-owned ranges of memory, thus we must protect against concurrent page
table freeing.

> + * cannot still be reading a page we release.
> + */
> list_for_each_entry_safe(pt, next, &page_list, pt_list)
> - __pagetable_free(pt);
> + call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);

I wonder if this would somehow be inefficient for a lot of page tables
being freed... but not sure there's an alternative here without somehow
pinning the list.

But probably it's ok.

> }
>
> void pagetable_free_kernel(struct ptdesc *pt)
> @@ -446,4 +460,9 @@ void pagetable_free_kernel(struct ptdesc *pt)
>
> schedule_work(&kernel_pgtable_work.work);
> }
> +#else
> +void pagetable_free_kernel(struct ptdesc *pt)
> +{
> + call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);

Please don't add code like this without explanation. Add a comment
explaining why you are doing this, or say 'refer to the comment in
kernel_pgtable_work_func()'.

> +}
> #endif
> --
> 2.53.0
>

I think doing this under RCU is the correct solution here, overall.

So with the comments addressed above for the freeing side, I think the
locking side should be something like:

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..b247c973e4d6 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -620,7 +620,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
* Note: Be careful to walk the kernel pages tables, the caller may be need to
* take other effective approaches (mmap lock may be insufficient) to prevent
* the intermediate kernel page tables belonging to the specified address range
- * from being freed (e.g. memory hot-remove).
+ * from being freed (e.g. memory hot-remove, vmap huge page promotion).
*/
int walk_kernel_page_table_range(unsigned long start, unsigned long end,
const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
@@ -643,7 +643,7 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
* Use this function to walk the kernel page tables locklessly. It should be
* guaranteed that the caller has exclusive access over the range they are
* operating on - that there should be no concurrent access, for example,
- * changing permissions for vmalloc objects.
+ * changing permissions for vmalloc objects, or vmap huge page promotion).
*/
int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end,
const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
@@ -677,6 +677,11 @@ int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end
* not backed by VMAs. Because 'unusual' entries may be walked this function
* will also not lock the PTEs for the pte_entry() callback.
*
+ * If traversing kernel mappings, the RCU lock must be held, since debug access
+ * to memory ranges implies the caller does not own these pages and thus the
+ * traversal might race with vmap huge page promotion which frees page tables
+ * under RCU.
+ *
* This is for debugging purposes ONLY.
*/
int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
@@ -691,10 +696,14 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
.no_vma = true
};

- /* For convenience, we allow traversal of kernel mappings. */
- if (mm == &init_mm)
+ /* For convenience, we allow traversal of kernel mappings under RCU. */
+ if (mm == &init_mm) {
+ RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
+ "no rcu read lock held");
+
return walk_kernel_page_table_range(start, end, ops,
pgd, private);
+ }
if (start >= end || !walk.mm)
return -EINVAL;
if (!check_ops_safe(ops))
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 973020000096..50cd96a33dfd 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -178,11 +178,13 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)

get_online_mems();
mmap_write_lock(mm);
+ rcu_read_lock();
while (range->start != range->end) {
walk_page_range_debug(mm, range->start, range->end,
&ptdump_ops, pgd, st);
range++;
}
+ rcu_read_unlock();
mmap_write_unlock(mm);
put_online_mems();

Can you also then:

- Check that none of the architectures that implement
ptdump_state->effective_prot_pXX() and ptdump_state->note_page_pXX()
callbacks sleep or do anything that's RCU-unsafe.

- Document this in the commit message.

Thanks, Lorenzo