Re: [PATCH v2] mm: pgtable: protect lockless kernel page table walks with RCU
From: David CARLIER
Date: Fri Jun 12 2026 - 13:21:46 EST
On Fri, 12 Jun 2026 at 17:12, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, 12 Jun 2026 06:05:40 +0100 David Carlier <devnexen@xxxxxxxxx> wrote:
>
> > ptdump walks the kernel page tables locklessly through
> > walk_kernel_page_table_range_lockless(). It only holds the init_mm
> > mmap lock and the memory hotplug lock, and neither excludes
> > vmalloc/ioremap teardown from freeing kernel PTE pages via
> > pmd_free_pte_page() -> pagetable_free_kernel(). syzbot hit a
> > use-after-free in ptdump_pte_entry() reading a PTE page that was freed
> > underneath the walk.
> >
> > Deferring the kernel page table free only batches the TLB flush; it does
> > not wait for lockless walkers. Mirror the user page table walk, where
> > pte_offset_map() already takes the RCU read lock: hold rcu_read_lock()
> > across the lockless kernel walk and rcu-free the page tables in the
> > kernel page table free worker, after the batched TLB flush. A walker
> > then either observes the cleared PMD and skips the page, or keeps it
> > alive until it drops the RCU read lock.
> >
> > ...
> >
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -655,13 +655,26 @@ int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end
> > .private = private,
> > .no_vma = true
> > };
> > + int err;
> >
> > if (start >= end)
> > return -EINVAL;
> > if (!check_ops_safe(ops))
> > return -EINVAL;
> >
> > - return walk_pgd_range(start, end, &walk);
> > + /*
> > + * Kernel intermediate page tables can be freed concurrently by
> > + * vmalloc/ioremap teardown (e.g. pmd_free_pte_page()), which routes
> > + * the freed pages through pagetable_free_kernel(). That path defers
> > + * the free past an RCU grace period, so hold the RCU read lock across
> > + * the lockless walk to prevent a page table from being freed while we
> > + * are still dereferencing it.
> > + */
> > + rcu_read_lock();
> > + err = walk_pgd_range(start, end, &walk);
> > + rcu_read_unlock();
> > +
> > + return err;
> > }
>
> Adding a lock to a function which is advertised to "walk the kernel
> page tables locklessly" is a bit of a head-spinner.
>
> Sashiko claims that some callback functions can perform sleeping
> allocations:
>
> https://sashiko.dev/#/patchset/20260612050540.31594-1-devnexen@xxxxxxxxx
>
>
Sashiko's right, and it's the same issue you flagged about the name.
arm64's range_split_to_ptes() also goes through
walk_kernel_page_table_range_lockless() and
passes GFP_PGTABLE_KERNEL into split_pmd()/split_pud(), which can
sleep, so rcu_read_lock() inside the lockless helper is wrong on both
counts.
v3 leaves that helper lockless and takes rcu_read_lock() only in the
init_mm branch of walk_page_range_debug(), whose sole caller is
ptdump. Its callbacks don't
sleep. The arm64 splitters keep relying on their existing exclusive
access guarantee, untouched.
Cheers.