Re: [PATCH v3] x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()

From: David Hildenbrand
Date: Wed Apr 02 2025 - 08:21:08 EST



v2 -> v3:
* Make some !MMU configs happy by just moving the code into memtype.c

Obviously we need to make the bots happy once again, re the issue at [0]...

[0]: https://lore.kernel.org/all/9b3b3296-ab21-418b-a0ff-8f5248f9b4ec@lucifer.local/

Which by the way you... didn't seem to be cc'd on, unless I missed it? I
had to manually add you in which is... weird.


v1 -> v2:
* Avoid a second get_pat_info() [and thereby fix the error checking]
by passing the pfn from track_pfn_copy() to untrack_pfn_copy()
* Simplify untrack_pfn_copy() by calling untrack_pfn().
* Retested

Not sure if we want to CC stable ... it's really hard to trigger in
sane environments.

This kind of code path is probably in reality... theoretical. So I'm fine
with this.


Thanks a bunch for your review!


---
arch/x86/mm/pat/memtype.c | 52 +++++++++++++++++++++------------------
include/linux/pgtable.h | 28 ++++++++++++++++-----
kernel/fork.c | 4 +++
mm/memory.c | 11 +++------
4 files changed, 58 insertions(+), 37 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index feb8cc6a12bf2..d721cc19addbd 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -984,29 +984,42 @@ static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
return -EINVAL;
}

-/*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
- *
- * If the vma has a linear pfn mapping for the entire range, we get the prot
- * from pte and reserve the entire vma range with single reserve_pfn_range call.
- */
-int track_pfn_copy(struct vm_area_struct *vma)
+int track_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma, unsigned long *pfn)

I think we need an additional 'tracked' parameter so we know whether or not
this pfn is valid.

See below.


It's kind of icky... see the bot report for context, but we we sort of need
to differentiate between 'error' and 'nothing to do'. Of course PFN can
conceivably be 0 so we can't just return that or an error (plus return
values that can be both errors and values are fraught anyway).

So I guess -maybe- least horrid thing is:

int track_pfn_copy(struct vm_area_struct *dst_vma,
struct vm_area_struct *src_vma, unsigned long *pfn,
bool *pfn_tracked);

Then you can obviously invoke with track_pfn_copy(..., &pfn_tracked); and
do an if (pfn_tracked) untrack_pfn_copy(...).

I'm really not in favour of just initialising PFN to 0 because there are
code paths where this might actually get passed around and used
incorrectly.

But on the other hand - I get that this is disgusting.

I'm in favor of letting VM_PAT take care of that. Observe how untrack_pfn_copy() -> untrack_pfn() takes care of that by checking for VM_PAT.

So this should be working as expected? No need to add something on top that makes it even more ugly in the caller.



{
+ const unsigned long vma_size = src_vma->vm_end - src_vma->vm_start;
resource_size_t paddr;
- unsigned long vma_size = vma->vm_end - vma->vm_start;
pgprot_t pgprot;
+ int rc;

- if (vma->vm_flags & VM_PAT) {
- if (get_pat_info(vma, &paddr, &pgprot))
- return -EINVAL;
- /* reserve the whole chunk covered by vma. */
- return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
- }
+ if (!(src_vma->vm_flags & VM_PAT))
+ return 0;

I do always like the use of the guard clause pattern :)

But here we have a case where now error = 0, pfn not set, and we will try
to untrack it despite !VM_PAT.

Right, and untrack_pfn() is smart enough to filter that out. (just like for any other invokation)


+
+ /*
+ * Duplicate the PAT information for the dst VMA based on the src
+ * VMA.
+ */
+ if (get_pat_info(src_vma, &paddr, &pgprot))
+ return -EINVAL;
+ rc = reserve_pfn_range(paddr, vma_size, &pgprot, 1);
+ if (rc)
+ return rc;

I mean it's a crazy nit, but we use ret elsewhere but rc here, maybe better
to use ret in both places.

But also feel free to ignore this.

"int retval;" ? ;)



+ /* Reservation for the destination VMA succeeded. */
+ vm_flags_set(dst_vma, VM_PAT);
+ *pfn = PHYS_PFN(paddr);
return 0;
}

+void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn)
+{
+ untrack_pfn(dst_vma, pfn, dst_vma->vm_end - dst_vma->vm_start, true);
+ /*
+ * Reservation was freed, any copied page tables will get cleaned
+ * up later, but without getting PAT involved again.
+ */
+}
+
/*
* prot is passed in as a parameter for the new mapping. If the vma has
* a linear pfn mapping for the entire range, or no vma is provided,
@@ -1095,15 +1108,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
}
}

-/*
- * untrack_pfn_clear is called if the following situation fits:
- *
- * 1) while mremapping a pfnmap for a new region, with the old vma after
- * its pfnmap page table has been removed. The new vma has a new pfnmap
- * to the same pfn & cache type with VM_PAT set.
- * 2) while duplicating vm area, the new vma fails to copy the pgtable from
- * old vma.
- */

> This just wrong now?

Note that I'm keeping the doc to a single place -- the stub in the header. (below)

Or can you elaborate what exactly is "wrong"?


void untrack_pfn_clear(struct vm_area_struct *vma)
{
vm_flags_clear(vma, VM_PAT);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94d267d02372e..4c107e17c547e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1508,14 +1508,25 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
}

/*
- * track_pfn_copy is called when vma that is covering the pfnmap gets
- * copied through copy_page_range().
+ * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page
+ * tables copied during copy_page_range(). On success, stores the pfn to be
+ * passed to untrack_pfn_copy().
*/
-static inline int track_pfn_copy(struct vm_area_struct *vma)
+static inline int track_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma, unsigned long *pfn)
{
return 0;
}

+/*
+ * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during
+ * copy_page_range(), but after track_pfn_copy() was already called.
+ */

Do we really care to put a comment like this on a stub function?

Whoever started this beautiful VM_PAT code decided to do it like that: and I think it's better kept at a single location.


+static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+ unsigned long pfn)
+{
+}
+
/*
* untrack_pfn is called while unmapping a pfnmap for a region.
* untrack can be called for a specific region indicated by pfn and size or
@@ -1528,8 +1539,10 @@ static inline void untrack_pfn(struct vm_area_struct *vma,
}

/*
- * untrack_pfn_clear is called while mremapping a pfnmap for a new region
- * or fails to copy pgtable during duplicate vm area.
+ * untrack_pfn_clear is called in the following cases on a VM_PFNMAP VMA:
+ *
+ * 1) During mremap() on the src VMA after the page tables were moved.
+ * 2) During fork() on the dst VMA, immediately after duplicating the src VMA.
*/

Can I say as an aside that I hate this kind of hook? Like quite a lot?

I mean I've been looking at mremap() of anon mappings as you know obv. but
the thought of PFN mapping mremap()ing is kind of also a bit ugh.

I absolutely hate all of that, but I'll have to leave any cleanups to people with more spare time ;)


static inline void untrack_pfn_clear(struct vm_area_struct *vma)
{
@@ -1540,7 +1553,10 @@ extern int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
unsigned long size);
extern void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot,
pfn_t pfn);
-extern int track_pfn_copy(struct vm_area_struct *vma);
+extern int track_pfn_copy(struct vm_area_struct *dst_vma,
+ struct vm_area_struct *src_vma, unsigned long *pfn);
+extern void untrack_pfn_copy(struct vm_area_struct *dst_vma,
+ unsigned long pfn);
extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
unsigned long size, bool mm_wr_locked);
extern void untrack_pfn_clear(struct vm_area_struct *vma);
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f32..ca2ca3884f763 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -504,6 +504,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
vma_numab_state_init(new);
dup_anon_vma_name(orig, new);

+ /* track_pfn_copy() will later take care of copying internal state. */
+ if (unlikely(new->vm_flags & VM_PFNMAP))
+ untrack_pfn_clear(new);

OK so maybe I'm being stupid here, but - is it the case that

a. We duplicate a VMA that has a PAT-tracked PFN map
> b. We must clear any existing tracking so everything is 'reset' to zero> c. track_pfn_copy() will later in fork process set anything up we need here.

Is this correct?

Right. But b) is actually not "clearing any tracking" (because there is no tracking/reservation for the copied version yet) but marking it as "not tracked".


+
return new;
}

diff --git a/mm/memory.c b/mm/memory.c
index fb7b8dc751679..dc8efa1358e94 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1362,12 +1362,12 @@ int
copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
{
pgd_t *src_pgd, *dst_pgd;
- unsigned long next;
unsigned long addr = src_vma->vm_start;
unsigned long end = src_vma->vm_end;
struct mm_struct *dst_mm = dst_vma->vm_mm;
struct mm_struct *src_mm = src_vma->vm_mm;
struct mmu_notifier_range range;
+ unsigned long next, pfn;
bool is_cow;
int ret;

@@ -1378,11 +1378,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
return copy_hugetlb_page_range(dst_mm, src_mm, dst_vma, src_vma);

if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
- /*
- * We do not free on error cases below as remove_vma
- * gets called on error from higher level routine
- */
- ret = track_pfn_copy(src_vma);
+ ret = track_pfn_copy(dst_vma, src_vma, &pfn);
if (ret)
return ret;
}
@@ -1419,7 +1415,6 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
continue;
if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
addr, next))) {
- untrack_pfn_clear(dst_vma);
ret = -ENOMEM;
break;
}
@@ -1429,6 +1424,8 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
raw_write_seqcount_end(&src_mm->write_protect_seq);
mmu_notifier_invalidate_range_end(&range);
}
+ if (ret && unlikely(src_vma->vm_flags & VM_PFNMAP))
+ untrack_pfn_copy(dst_vma, pfn);

Yeah, the problem here is that !(src_vma->vm_flags & VM_PFNMAP) is not the
_only_ way we can not have a valid pfn.

Do we still want to untrack_pfn_copy() even if !VM_PAT?

Sure, let that be handled internally, where all the ugly VM_PAT handling resides.

Unless there is very good reason to do it differently.

Thanks!

--
Cheers,

David / dhildenb