Re: [PATCH] mm/x86/pat: Only untrack the pfn range if unmap region

From: Peter Xu
Date: Mon Jul 15 2024 - 11:03:19 EST


On Sun, Jul 14, 2024 at 08:27:25PM +0200, David Hildenbrand wrote:
> On 14.07.24 12:59, David Wang wrote:
> >
> > At 2024-07-12 22:42:44, "Peter Xu" <peterx@xxxxxxxxxx> wrote:
> > > NOTE: I massaged the commit message comparing to the rfc post [1], the
> > > patch itself is untouched. Also removed rfc tag, and added more people
> > > into the loop. Please kindly help test this patch if you have a reproducer,
> > > as I can't reproduce it myself even with the syzbot reproducer on top of
> > > mm-unstable. Instead of further check on the reproducer, I decided to send
> > > this out first as we have a bunch of reproducers on the list now..
> > > ---
> > > mm/memory.c | 5 ++---
> > > 1 file changed, 2 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 4bcd79619574..f57cc304b318 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1827,9 +1827,6 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> > > if (vma->vm_file)
> > > uprobe_munmap(vma, start, end);
> > >
> > > - if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > - untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > -
> > > if (start != end) {
> > > if (unlikely(is_vm_hugetlb_page(vma))) {
> > > /*
> > > @@ -1894,6 +1891,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> > > unsigned long start = start_addr;
> > > unsigned long end = end_addr;
> > > hugetlb_zap_begin(vma, &start, &end);
> > > + if (unlikely(vma->vm_flags & VM_PFNMAP))
> > > + untrack_pfn(vma, 0, 0, mm_wr_locked);
> > > unmap_single_vma(tlb, vma, start, end, &details,
> > > mm_wr_locked);
> > > hugetlb_zap_end(vma, &details);
> > > --
> > > 2.45.0
> >
> > Hi,
> >
> > Today, I notice a kernel warning with this patch.
> >
> >
> > [Sun Jul 14 16:51:38 2024] OOM killer enabled.
> > [Sun Jul 14 16:51:38 2024] Restarting tasks ... done.
> > [Sun Jul 14 16:51:38 2024] random: crng reseeded on system resumption
> > [Sun Jul 14 16:51:38 2024] PM: suspend exit
> > [Sun Jul 14 16:51:38 2024] ------------[ cut here ]------------
> > [Sun Jul 14 16:51:38 2024] WARNING: CPU: 1 PID: 2484 at arch/x86/mm/pat/memtype.c:1002 untrack_pfn+0x10c/0x120
>
> We fail to find what we need in the page tables, indicating that the page
> tables might have been modified / torn down in the meantime.
>
> Likely we have a previous call to unmap_single_vma() that modifies the page
> tables, and unmaps present PFNs.
>
> PAT is incompatible to that, it relies on information from the page tables
> to know what it has to undo during munmap(), or what it has to do during
> fork().
>
> The splat from the previous discussion [1]:
>
> follow_phys arch/x86/mm/pat/memtype.c:957 [inline]
> get_pat_info+0xf2/0x510 arch/x86/mm/pat/memtype.c:991
> untrack_pfn+0xf7/0x4d0 arch/x86/mm/pat/memtype.c:1104
> unmap_single_vma+0x1bd/0x2b0 mm/memory.c:1819
> zap_page_range_single+0x326/0x560 mm/memory.c:1920
> unmap_mapping_range_vma mm/memory.c:3684 [inline]
> unmap_mapping_range_tree mm/memory.c:3701 [inline]
> unmap_mapping_pages mm/memory.c:3767 [inline]
> unmap_mapping_range+0x1ee/0x280 mm/memory.c:3804
> truncate_pagecache+0x53/0x90 mm/truncate.c:731
> simple_setattr+0xf2/0x120 fs/libfs.c:886
> notify_change+0xec6/0x11f0 fs/attr.c:499
> do_truncate+0x15c/0x220 fs/open.c:65
> handle_truncate fs/namei.c:3308 [inline]
>
> indicates that file truncation seems to end up messing with a PFNMAP mapping
> that has PAT set. That is ... weird. I would have thought that PFNMAP would
> never really happen with file truncation.
>
> Does this only happen with an OOT driver, that seems to do weird truncate
> stuff on files that have a PFNMAP mapping?
>
> [1]
> https://lore.kernel.org/all/3879ee72-84de-4d2a-93a8-c0b3dc3f0a4c@xxxxxxxxxx/

Ohhh.. I guess this will also stop working in VFIO, but I think it's fine
for now because as Yan pointed out VFIO PCI doesn't register those regions
now so VM_PAT is not yet set..

And one thing I said wrong in the previous reply to Yan is, obviously
memtype_check_insert() can work with >1 owners as long as the memtype
matches.. and that's how fork() works where VM_PAT needs to be duplicated.
But this whole thing is a bit confusing to me.. As I think it also means
when fork the track_pfn_copy() will call memtype_kernel_map_sync one more
time even if we're 100% sure the pgprot will be the same for the kernel
mappings..

I wonder whether there's some way that untrack pfn framework doesn't need
to rely on the pgtable to fetch the pfn, because VFIO MMIO region
protection will also do that in the near future, AFAICT. The pgprot part
should be easy there to fetch: get_pat_info() should fallback to vma's
pgprot if no mapping found; the only outlier should be CoW pages in
reality. The pfn is the real issue so far, so that either track_pfn_copy()
or untrack_pfn() may need to know the pfn to untrack, even if it only has
the vma information.

Thanks,

--
Peter Xu