Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

From: Guillaume Morin
Date: Tue Apr 30 2024 - 14:58:49 EST


On 30 Apr 20:21, David Hildenbrand wrote:
> Sorry for not replying earlier, was busy with other stuff. I'll try getiing
> that stuff into shape and send it out soonish.

No worries. Let me know what you think of the FOLL_FORCE patch when you
have a sec.

> > I went with using one write uprobe function with some additional
> > branches. I went back and forth between that and making them 2 different
> > functions.
>
> All the folio_test_hugetlb() special casing is a bit suboptimal. Likely we
> want a separate variant, because we should be sing hugetlb PTE functions
> consistently (e.g., huge_pte_uffd_wp() vs pte_uffd_wp(), softdirty does not
> exist etc.)

Ok, I'll give this a whirl and send something probably tomorrow.

> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index 2f4e88552d3f..8a33e380f7ea 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -83,6 +83,10 @@ static const struct fs_parameter_spec hugetlb_fs_parameters[] = {
> > {}
> > };
> > +bool hugetlbfs_mapping(struct address_space *mapping) {
> > + return mapping->a_ops == &hugetlbfs_aops;
>
> is_vm_hugetlb_page() might be what you are looking for.

I use hugetlbfs_mapping() in __uprobe_register() which does an early
return using the mapping only if it's not supported. There is no vma
that I can get at this point (afaict).

I could refactor so we check this when we have a vma but it looked
cleaner to introduce it since there is already shmem_mapping()

> > }
> > -static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)
> > +static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len, unsigned long page_mask)
> > {
> > void *kaddr = kmap_atomic(page);
> > - memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
> > + memcpy(dst, kaddr + (vaddr & ~page_mask), len);
> > kunmap_atomic(kaddr);
> > }
>
> > -static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)
> > +static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len, unsigned long page_mask)
> > {
> > void *kaddr = kmap_atomic(page);
> > - memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
> > + memcpy(kaddr + (vaddr & ~page_mask), src, len);
> > kunmap_atomic(kaddr);
> > }
>
> These two changes really are rather ugly ...
>
> An why are they even required? We get a PAGE_SIZED-based subpage of a
> hugetlb page. We only kmap that one and copy within that one.
>
> In other words, I don't think the copy_from_page() and copy_to_page()
> changes are even required when we consistently work on subpages and not
> suddenly on head pages.

The main reason is that the previous __replace_page worked directly on the full
HP page so adjusting after gup seemed to make more sense to me. But
now I guess it's not that useful (esp we're going with a different
version of write_uprobe). I'll fix

(...)
> > {
> > struct uwo_data *data = walk->private;;
> > const bool is_register = !!is_swbp_insn(&data->opcode);
> > @@ -415,9 +417,12 @@ static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr,
> > /* Unmap + flush the TLB, such that we can write atomically .*/
> > flush_cache_page(vma, vaddr, pte_pfn(pte));
> > - pte = ptep_clear_flush(vma, vaddr, ptep);
> > + if (folio_test_hugetlb(folio))
> > + pte = huge_ptep_clear_flush(vma, vaddr, ptep);
> > + else
> > + pte = ptep_clear_flush(vma, vaddr, ptep);
> > copy_to_page(page, data->opcode_vaddr, &data->opcode,
> > - UPROBE_SWBP_INSN_SIZE);
> > + UPROBE_SWBP_INSN_SIZE, page_mask);
> > /* When unregistering, we may only zap a PTE if uffd is disabled ... */
> > if (is_register || userfaultfd_missing(vma))
> > @@ -443,13 +448,18 @@ static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr,
> > if (!identical || folio_maybe_dma_pinned(folio))
> > goto remap;
> > - /* Zap it and try to reclaim swap space. */
> > - dec_mm_counter(mm, MM_ANONPAGES);
> > - folio_remove_rmap_pte(folio, page, vma);
> > - if (!folio_mapped(folio) && folio_test_swapcache(folio) &&
> > - folio_trylock(folio)) {
> > - folio_free_swap(folio);
> > - folio_unlock(folio);
> > + if (folio_test_hugetlb(folio)) {
> > + hugetlb_remove_rmap(folio);
> > + large = false;
> > + } else {
> > + /* Zap it and try to reclaim swap space. */
> > + dec_mm_counter(mm, MM_ANONPAGES);
> > + folio_remove_rmap_pte(folio, page, vma);
> > + if (!folio_mapped(folio) && folio_test_swapcache(folio) &&
> > + folio_trylock(folio)) {
> > + folio_free_swap(folio);
> > + folio_unlock(folio);
> > + }
> > }
> > folio_put(folio);
> > @@ -461,11 +471,29 @@ static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr,
> > */
> > smp_wmb();
> > /* We modified the page. Make sure to mark the PTE dirty. */
> > - set_pte_at(mm, vaddr, ptep, pte_mkdirty(pte));
> > + if (folio_test_hugetlb(folio))
> > + set_huge_pte_at(mm , vaddr, ptep, huge_pte_mkdirty(pte),
> > + (~page_mask) + 1);
> > + else
> > + set_pte_at(mm, vaddr, ptep, pte_mkdirty(pte));
> > return UWO_DONE;
> > }
> > +static int __write_opcode_hugetlb(pte_t *ptep, unsigned long hmask,
> > + unsigned long vaddr,
> > + unsigned long next, struct mm_walk *walk)
> > +{
> > + return __write_opcode(ptep, vaddr, hmask, walk);
> > +}
> > +
> > +static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr,
> > + unsigned long next, struct mm_walk *walk)
> > +{
> > + return __write_opcode(ptep, vaddr, PAGE_MASK, walk);
> > +}
> > +
> > static const struct mm_walk_ops write_opcode_ops = {
> > + .hugetlb_entry = __write_opcode_hugetlb,
> > .pte_entry = __write_opcode_pte,
> > .walk_lock = PGWALK_WRLOCK,
> > };
> > @@ -492,7 +520,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > unsigned long opcode_vaddr, uprobe_opcode_t opcode)
> > {
> > struct uprobe *uprobe = container_of(auprobe, struct uprobe, arch);
> > - const unsigned long vaddr = opcode_vaddr & PAGE_MASK;
> > + unsigned long vaddr = opcode_vaddr & PAGE_MASK;
> > const bool is_register = !!is_swbp_insn(&opcode);
> > struct uwo_data data = {
> > .opcode = opcode,
> > @@ -503,6 +531,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > struct mmu_notifier_range range;
> > int ret, ref_ctr_updated = 0;
> > struct page *page;
> > + unsigned long page_size = PAGE_SIZE;
> > if (WARN_ON_ONCE(!is_cow_mapping(vma->vm_flags)))
> > return -EINVAL;
> > @@ -521,7 +550,14 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
> > if (ret != 1)
> > goto out;
> > - ret = verify_opcode(page, opcode_vaddr, &opcode);
> > +
> > + if (is_vm_hugetlb_page(vma)) {
> > + struct hstate *h = hstate_vma(vma);
> > + page_size = huge_page_size(h);
> > + vaddr &= huge_page_mask(h);
> > + page = compound_head(page);
>
> I think we should only adjust the range we pass to the mmu notifier and for
> walking the VMA range. But we should not adjust vaddr.
>
> Further, we should not adjust the page if possible ... ideally, we'll treat
> hugetlb folios just like large folios here and operate on subpages.
>
> Inside __write_opcode(), we can derive the the page of interest from
> data->opcode_vaddr.

Here you mean __write_opcode_hugetlb(), right? Since we're going with
the 2 independent variants. Just want to 100% sure I am following

> find_get_page() might need some though, if it won't return a subpage of a
> hugetlb folio. Should be solvable by a wrapper, though.

We can zero out the subbits with the huge page mask in the
vaddr_to_offset() in the hugetlb variant like I do in __copy_insn() and
that should work, no? Or you prefer a wrapper?

Guillaume.

--
Guillaume Morin <guillaume@xxxxxxxxxxx>