Re: [PATCH v2 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather

From: Harry Yoo

Date: Fri Dec 19 2025 - 07:38:36 EST


On Fri, Dec 12, 2025 at 08:10:19AM +0100, David Hildenbrand (Red Hat) wrote:
> As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix
> huge_pmd_unshare() vs GUP-fast race") we can end up in some situations
> where we perform so many IPI broadcasts when unsharing hugetlb PMD page
> tables that it severely regresses some workloads.
>
> In particular, when we fork()+exit(), or when we munmap() a large
> area backed by many shared PMD tables, we perform one IPI broadcast per
> unshared PMD table.
>

[...snip...]

> Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race")
> Reported-by: Uschakow, Stanislav" <suschako@xxxxxxxxx>
> Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@xxxxxxxxx/
> Tested-by: Laurence Oberman <loberman@xxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>
> Signed-off-by: David Hildenbrand (Red Hat) <david@xxxxxxxxxx>
> ---
> include/asm-generic/tlb.h | 74 ++++++++++++++++++++++-
> include/linux/hugetlb.h | 19 +++---
> mm/hugetlb.c | 121 ++++++++++++++++++++++----------------
> mm/mmu_gather.c | 7 +++
> mm/mprotect.c | 2 +-
> mm/rmap.c | 25 +++++---
> 6 files changed, 179 insertions(+), 69 deletions(-)
>
> @@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
> pte = huge_pte_clear_uffd_wp(pte);
> huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
> pages++;
> + tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
> }
>
> next:
> spin_unlock(ptl);
> cond_resched();
> }
> - /*
> - * There is nothing protecting a previously-shared page table that we
> - * unshared through huge_pmd_unshare() from getting freed after we
> - * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
> - * succeeded, flush the range corresponding to the pud.
> - */
> - if (shared_pmd)
> - flush_hugetlb_tlb_range(vma, range.start, range.end);
> - else
> - flush_hugetlb_tlb_range(vma, start, end);
> +
> + tlb_flush_mmu_tlbonly(tlb);
> + huge_pmd_unshare_flush(tlb, vma);

Shouldn't we teach mmu_gather that it has to call
flush_hugetlb_tlb_range() instead of ordinary TLB flush routine,
otherwise it will break ARCHes that has "special requirements"
for evicting hugetlb backing TLB entries?

> /*
> * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are
> * downgrading page table protection not changing it to point to a new

--
Cheers,
Harry / Hyeonggon