Re: [PATCH v2] mm: pgtable: protect lockless kernel page table walks with RCU
From: Andrew Morton
Date: Fri Jun 12 2026 - 12:18:16 EST
On Fri, 12 Jun 2026 06:05:40 +0100 David Carlier <devnexen@xxxxxxxxx> wrote:
> ptdump walks the kernel page tables locklessly through
> walk_kernel_page_table_range_lockless(). It only holds the init_mm
> mmap lock and the memory hotplug lock, and neither excludes
> vmalloc/ioremap teardown from freeing kernel PTE pages via
> pmd_free_pte_page() -> pagetable_free_kernel(). syzbot hit a
> use-after-free in ptdump_pte_entry() reading a PTE page that was freed
> underneath the walk.
>
> Deferring the kernel page table free only batches the TLB flush; it does
> not wait for lockless walkers. Mirror the user page table walk, where
> pte_offset_map() already takes the RCU read lock: hold rcu_read_lock()
> across the lockless kernel walk and rcu-free the page tables in the
> kernel page table free worker, after the batched TLB flush. A walker
> then either observes the cleared PMD and skips the page, or keeps it
> alive until it drops the RCU read lock.
>
> ...
>
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -655,13 +655,26 @@ int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end
> .private = private,
> .no_vma = true
> };
> + int err;
>
> if (start >= end)
> return -EINVAL;
> if (!check_ops_safe(ops))
> return -EINVAL;
>
> - return walk_pgd_range(start, end, &walk);
> + /*
> + * Kernel intermediate page tables can be freed concurrently by
> + * vmalloc/ioremap teardown (e.g. pmd_free_pte_page()), which routes
> + * the freed pages through pagetable_free_kernel(). That path defers
> + * the free past an RCU grace period, so hold the RCU read lock across
> + * the lockless walk to prevent a page table from being freed while we
> + * are still dereferencing it.
> + */
> + rcu_read_lock();
> + err = walk_pgd_range(start, end, &walk);
> + rcu_read_unlock();
> +
> + return err;
> }
Adding a lock to a function which is advertised to "walk the kernel
page tables locklessly" is a bit of a head-spinner.
Sashiko claims that some callback functions can perform sleeping
allocations:
https://sashiko.dev/#/patchset/20260612050540.31594-1-devnexen@xxxxxxxxx