Re: [PATCH v4] mm/hugetlb: support FOLL_FORCE|FOLL_WRITE

From: David Hildenbrand
Date: Fri Dec 06 2024 - 11:04:19 EST


On 06.12.24 15:49, Guillaume Morin wrote:
Eric reported that PTRACE_POKETEXT fails when applications use hugetlb
for mapping text using huge pages. Before commit 1d8d14641fd9
("mm/hugetlb: support write-faults in shared mappings"), PTRACE_POKETEXT
worked by accident, but it was buggy and silently ended up mapping pages
writable into the page tables even though VM_WRITE was not set.

In general, FOLL_FORCE|FOLL_WRITE does currently not work with hugetlb.
Let's implement FOLL_FORCE|FOLL_WRITE properly for hugetlb, such that
what used to work in the past by accident now properly works, allowing
applications using hugetlb for text etc. to get properly debugged.

This change might also be required to implement uprobes support for
hugetlb [1].

[1] https://lore.kernel.org/lkml/ZiK50qob9yl5e0Xz@xxxxxxxxxxxxxxxxxx/

Cc: Muchun Song <muchun.song@xxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Peter Xu <peterx@xxxxxxxxxx>
Cc: David Hildenbrand <david@xxxxxxxxxx>
Cc: Eric Hagberg <ehagberg@xxxxxxxxxxxxxx>
Signed-off-by: Guillaume Morin <guillaume@xxxxxxxxxxx>
---
Changes in v2:
- Improved commit message
Changes in v3:
- Fix potential unitialized mem access in follow_huge_pud
- define pud_soft_dirty when soft dirty is not enabled
Changes in v4:
- Remove the soft dirty pud check
- Remove the pud_soft_dirty added in v3

mm/gup.c | 95 +++++++++++++++++++++++++---------------------------
mm/hugetlb.c | 20 ++++++-----
2 files changed, 57 insertions(+), 58 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 746070a1d8bf..63c705ff4162 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -587,6 +587,33 @@ static struct folio *try_grab_folio_fast(struct page *page, int refs,
}
#endif /* CONFIG_HAVE_GUP_FAST */
+/* Common code for can_follow_write_* */
+static inline bool can_follow_write_common(struct page *page,
+ struct vm_area_struct *vma, unsigned int flags)
+{
+ /* Maybe FOLL_FORCE is set to override it? */
+ if (!(flags & FOLL_FORCE))
+ return false;
+
+ /* But FOLL_FORCE has no effect on shared mappings */
+ if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+ return false;
+
+ /* ... or read-only private ones */
+ if (!(vma->vm_flags & VM_MAYWRITE))
+ return false;
+
+ /* ... or already writable ones that just need to take a write fault */
+ if (vma->vm_flags & VM_WRITE)
+ return false;
+
+ /*
+ * See can_change_pte_writable(): we broke COW and could map the page
+ * writable if we have an exclusive anonymous page ...
+ */
+ return page && PageAnon(page) && PageAnonExclusive(page);
+}
+
static struct page *no_page_table(struct vm_area_struct *vma,
unsigned int flags, unsigned long address)
{
@@ -613,6 +640,18 @@ static struct page *no_page_table(struct vm_area_struct *vma,
}
#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+/* FOLL_FORCE can write to even unwritable PUDs in COW mappings. */
+static inline bool can_follow_write_pud(pud_t pud, struct page *page,
+ struct vm_area_struct *vma,
+ unsigned int flags)
+{
+ /* If the pud is writable, we can write to the page. */
+ if (pud_write(pud))
+ return true;
+
+ return can_follow_write_common(page, vma, flags);
+}
+
static struct page *follow_huge_pud(struct vm_area_struct *vma,
unsigned long addr, pud_t *pudp,
int flags, struct follow_page_context *ctx)
@@ -625,13 +664,16 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
assert_spin_locked(pud_lockptr(mm, pudp));
- if ((flags & FOLL_WRITE) && !pud_write(pud))
+ pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
+ page = pfn_to_page(pfn);
+
+ if ((flags & FOLL_WRITE) &&
+ !can_follow_write_pud(pud, page, vma, flags))
return NULL;
if (!pud_present(pud))
return NULL;
- pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;

That looks wrong. See follow_huge_pmd() for reference

(1) You must not do a pfn_to_page() before we verified that we have a
present PUD.

(2) can_follow_write_pud() must be called with the first mapped page. It
would currently with hugetlb not be strictly required, but is not
future proof.



It must be likely be something like:


if (!pud_present(pud))
return NULL;

if ((flags & FOLL_WRITE) &&
!can_follow_write_pud(pud, pfn_to_page(pfn), vma, flags))
return NULL;

pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
page = pfn_to_page(pfn);


delayacct_wpcopy_end();
return 0;
@@ -5943,7 +5944,8 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
spin_lock(vmf->ptl);
vmf->pte = hugetlb_walk(vma, vmf->address, huge_page_size(h));
if (likely(vmf->pte && pte_same(huge_ptep_get(mm, vmf->address, vmf->pte), pte))) {
- pte_t newpte = make_huge_pte(vma, &new_folio->page, !unshare);
+ const bool writable = !unshare && (vma->vm_flags & VM_WRITE);
+ pte_t newpte = make_huge_pte(vma, &new_folio->page, writable);
);

You probably missed my earlier comment. After the recent changes to make_huge_pte() that are already in mm/mm-unstable, this hunk can be dropped and left unchanged. make_huge_pte() will perform the VM_WRITE check.

--
Cheers,

David / dhildenb