Re: [Question] CoW on VM_PFNMAP vma during write fault

From: David Hildenbrand
Date: Tue Feb 27 2024 - 08:00:25 EST


On 27.02.24 13:28, Wupeng Ma wrote:
We find that a warn will be produced during our test, the detail log is
shown in the end.

The core problem of this warn is that the first pfn of this pfnmap vma is
cleared during memory-failure. Digging into the source we find that this
problem can be triggered as following:

// mmap with MAP_PRIVATE and specific fd which hook mmap
mmap(MAP_PRIVATE, fd)
__mmap_region
remap_pfn_range
// set vma with pfnmap and the prot of pte is read only


Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.

What fd is that exactly? Often, we disallow private mappings in the mmap() callback (for a good reason).

// memset this memory with trigger fault
handle_mm_fault
__handle_mm_fault
handle_pte_fault
// write fault and !pte_write(entry)
do_wp_page
wp_page_copy // this will alloc a new page with valid page struct
// for this pfnmap vma

Here we replace the mapped PFNMAP thingy by a proper anon folio.


// inject a hwpoison to the first page of this vma

I assume this is an anon folio?

madvise_inject_error
memory_failure
hwpoison_user_mappings
try_to_unmap_one
// mark this pte as invalid (hwpoison)
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
address, range.end);

// during unmap vma, the first pfn of this pfnmap vma is invalid
vm_mmap_pgoff
do_mmap
__do_mmap_mm
__mmap_region
__do_munmap
unmap_region
unmap_vmas
unmap_single_vma
untrack_pfn
follow_phys // pte is already invalidate, WARN_ON here

unmap_single_vma()->...->zap_pte_range() should do the right thing when calling vm_normal_page().

untrack_pfn() is the problematic part.


CoW with a valid page for pfnmap vma is weird to us. Can we use
remap_pfn_range for private vma(read only)? Once CoW happens on a pfnmap
vma during write fault, this page is normal(page flag is valid) for most mm
subsystems, such as memory failure in thais case and extra should be done to
handle this special page.

During unmap, if this vma is pfnmap, unmap shouldn't be done since page
should not be touched for pfnmap vma.

But the root problem is that can we insert a valid page for pfnmap vma?

Any thoughts to solve this warn?

vm_normal_page() documentation explains how that magic is supposed to work. vm_normal_page() should be able to correctly identify whether we want to look at the struct page for an anon folio that was COWed.


untrack_pfn() indeed does not seem to be well prepared for handling MAP_PRIVATE mappings where we end up having anon folios.

I think it will already *completely mess up* simply when unmapping the range without the memory failure involved.

See, follow_phys() would get the PFN of the anon folio and then untrack_pfn() would do some nonesense with that. Completely broken.

The WARN is just a side-effect of the brokenness.

In follow_phys(), we'd likely have to call vm_normal_page(). If we get a page back, we'd likely have to fail follow_phys() instead of returning a PFN of an anon folio.

Now, how do we fix untrack_pfn() ? I really don't know. In theory, we might no longer have *any* PFNMAP PFN in there after COW'ing everything.

Sounds like MAP_PRIVATE VM_PFNMAP + __HAVE_PFNMAP_TRACKING is some broken garbage (sorry). Can we disallow it?

--
Cheers,

David / dhildenb