Re: [PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs

From: Thomas Weißschuh

Date: Fri Jun 12 2026 - 10:25:58 EST


On Mon, Jun 08, 2026 at 07:57:58PM +0400, Andrey Smirnov wrote:
> The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
> mapping and its pages are installed into userspace with vmf_insert_pfn(),
> which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
> pte_user_accessible_page() only tests the PRESENT/USER bits and does not
> exclude special PTEs, so page_table_check accounts these PFN mappings in
> the per-page anon/file map counters even though they are not rmap-managed
> pages (vm_normal_page() returns NULL for them).
>
> Most of these data pages live in the kernel image and are never freed, so
> the stray accounting is invisible. The time-namespace VVAR page is the
> exception: it is a real alloc_page() page that is released with
> __free_page() in free_time_ns() when the last task of a time namespace
> exits. Across the map / unmap / vdso_join_timens() zap transitions the
> special-PTE accounting is not balanced for this page, so a non-zero
> file_map_count survives to the free path and trips:
>
> kernel BUG at mm/page_table_check.c:143!
> __page_table_check_zero+0xfb/0x130
> __free_frozen_pages+0x52f/0x650
> free_time_ns+0x85/0xc0
> free_nsproxy+0x7f/0x130
> do_exit+0x313/0xa60
> do_group_exit+0x77/0x90
>
> This is reliably reproducible on x86_64 and arm64 under heavy container/CI
> churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
> runc / docker-init / tini), and was independently reported by syzbot on
> riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
>
> Special PTEs have no struct-page rmap semantics and must never have been
> tracked by page table check. Skip them in both the set and clear paths so
> the counters stay balanced (always zero) for PFN-mapped pages, regardless
> of how the architecture defines pte_user_accessible_page(). pte_special()
> is available generically (it is a no-op returning false on architectures
> without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
>
> Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
> ("vdso/datastore: Allocate data pages dynamically") incidentally avoids
> the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
> with balanced struct-page accounting. This patch fixes the still-affected
> VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
> page_table_check robust against any future PFN-mapped user pages.
>
> Fixes: df4e817b7108 ("mm: page table check")
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Thomas Weißschuh <thomas.weissschuh@xxxxxxxxxxxxx>
> Cc: Andrei Vagin <avagin@xxxxxxxxx>
> Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> Cc: Vincenzo Frascino <vincenzo.frascino@xxxxxxx>
> Reported-by: syzbot+2b5fe617654be3d8848b@xxxxxxxxxxxxxxxxxxxxxxxxx
> Closes: https://github.com/siderolabs/talos/issues/13496
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Andrey Smirnov <andrey.smirnov@xxxxxxxxxxxxxx>
> ---
> mm/page_table_check.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_table_check.c b/mm/page_table_check.c
> index 4eeca782b888..ee492d5389b9 100644
> --- a/mm/page_table_check.c
> +++ b/mm/page_table_check.c
> @@ -150,9 +150,16 @@ void __page_table_check_pte_clear(struct mm_struct *mm, pte_t pte)
> if (&init_mm == mm)
> return;
>
> - if (pte_user_accessible_page(pte)) {
> + /*
> + * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
> + * mapping installed via vmf_insert_pfn() - are not rmap-managed and
> + * must not be tracked here. Tracking them can leave a non-zero map
> + * count on a struct page that is later freed (the time namespace VVAR
> + * page in free_time_ns()), tripping the BUG_ON() in
> + * __page_table_check_zero().

As this comment mentioning the [vvar] pages is already stale, IMO this should
not be mentioned specifically. It is also not clear to me why this only happens
now and where the non-zero map count comes from.

> + */
> + if (pte_user_accessible_page(pte) && !pte_special(pte))
> page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
> - }
> }
> EXPORT_SYMBOL(__page_table_check_pte_clear);
>
> @@ -205,7 +212,7 @@ void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>
> for (i = 0; i < nr; i++)
> __page_table_check_pte_clear(mm, ptep_get(ptep + i));
> - if (pte_user_accessible_page(pte))
> + if (pte_user_accessible_page(pte) && !pte_special(pte))
> page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
> }
> EXPORT_SYMBOL(__page_table_check_ptes_set);
> --
> 2.53.0
>