Re: 6.10/bisected/regression - commit 8430557fc584 cause warning at mm/page_table_check.c:198 __page_table_check_ptes_set+0x306

From: Peter Xu
Date: Wed May 22 2024 - 11:18:27 EST


On Wed, May 22, 2024 at 09:48:51AM +0200, David Hildenbrand wrote:
> On 22.05.24 00:36, Peter Xu wrote:
> > On Wed, May 22, 2024 at 03:21:04AM +0500, Mikhail Gavrilov wrote:
> > > On Wed, May 22, 2024 at 2:37 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > > Hmm I still cannot reproduce. Weird.
> > > >
> > > > Would it be possible for you to identify which line in debug_vm_pgtable.c
> > > > triggered that issue?
> > > >
> > > > I think it should be some set_pte_at() but I'm not sure, as there aren't a
> > > > lot and all of them look benign so far. It could be that I missed
> > > > something important.
> > >
> > > I hope it's helps:
> >
> > Thanks for offering this, it's just that it doesn't look coherent with what
> > was reported for some reason.
> >
> > >
> > > > sh /usr/src/kernels/(uname -r)/scripts/faddr2line /lib/debug/lib/modules/(uname -r)/vmlinux debug_vm_pgtable+0x1c04
> > > debug_vm_pgtable+0x1c04/0x3360:
> > > native_ptep_get_and_clear at arch/x86/include/asm/pgtable_64.h:94
> > > (inlined by) ptep_get_and_clear at arch/x86/include/asm/pgtable.h:1262
> > > (inlined by) ptep_clear at include/linux/pgtable.h:509
> >
> > This is a pte_clear(), and pte_clear() shouldn't even do the set() checks,
> > and shouldn't stumble over what I added.
> >
> > IOW, it doesn't match with the real stack dump previously:
> >
> > [ 5.581003] ? __page_table_check_ptes_set+0x306/0x3c0
> > [ 5.581274] ? __pfx___page_table_check_ptes_set+0x10/0x10
> > [ 5.581544] ? __pfx_check_pgprot+0x10/0x10
> > [ 5.581806] set_ptes.constprop.0+0x66/0xd0
> > [ 5.582072] ? __pfx_set_ptes.constprop.0+0x10/0x10
> > [ 5.582333] ? __pfx_pte_val+0x10/0x10
> > [ 5.582595] debug_vm_pgtable+0x1c04/0x3360
> >
>
> Staring at pte_clear_tests():
>
> #ifndef CONFIG_RISCV
> pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> #endif
> set_pte_at(args->mm, args->vaddr, args->ptep, pte);
>
> So we set random PTE bits, probably setting the present, uffd and write bit
> at the same time. That doesn't make too much sense when we want to perform
> that such combinations cannot exist.

Here the issue is I don't think it should set W bit anyway, as we init
page_prot to be RWX but !shared:

args->page_prot = vm_get_page_prot(VM_ACCESS_FLAGS);

On x86_64 (Mikhail's system) it should have W bit cleared afaict, meanwhile
the RANDOM_ORVALUE won't touch bit W due to S390_SKIP_MASK (which contains
bit W / bit 1, which is another "accident"..). Then even if with that it
should not trigger.. I think that's also why I cannot reproduce this
problem locally.

But I think applying random bits are indeed tricky, and I don't really know
why we did that. I can get that we want to set some non-empty pte, but
AFAIU this should be far enough:

pte_t pte = pfn_pte(args->pte_pfn, args->page_prot);

As that should already be pte_none()==false, then we clear and recheck
making sure pte_none(), looks good enough already. Obviously that trick
already broke PPC64 and S390 before due to existance of PPC64_SKIP_MASK
etc..

I guess it won't hurt in this case to double check, though. Mikhail, would
you mind mark this line to see whether it's the line that triggered your
WARNING? Perhaps also dump something more than that, something like:

===8<===
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index f1c9a2c5abc0..610b1996b2e9 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -635,7 +635,8 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args)
return;

#ifndef CONFIG_RISCV
- pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
+ pr_info("page_prot=0x%lx\n", pgprot_val(args->page_prot));
+ pr_info("pteval|RANDOM_ORVALUE=0x%lx\n", pte_val(pte) | RANDOM_ORVALUE);
#endif
set_pte_at(args->mm, args->vaddr, args->ptep, pte);
flush_dcache_page(page);
===8<===

For me it dumps:

[ 2.249478] debug_vm_pgtable: [pte_clear_tests ]: page_prot=0x25
[ 2.250049] debug_vm_pgtable: [pte_clear_tests ]: pteval|RANDOM_ORVALUE=0xbffffffffffffff5

Logically you should see the same, but since faddr2line doesn't seem to
work properly for some reason, maybe we can try.

>
> In pmd_clear_tests() and friends we use WRITE_ONCE() instead, so there we
> don't run into trouble.

Right, and I think they should probably use set_pmd_at() rather than
WRITE_ONCE() if we want to cover the helpers.. but that's another story.

Thanks,

--
Peter Xu