Re: [PATCH RESEND v3 2/2] mm: introduce pmdp_collapse_flush_sync() to skip redundant IPI

From: Lance Yang

Date: Tue Jan 06 2026 - 10:42:04 EST




On 2026/1/6 23:07, David Hildenbrand (Red Hat) wrote:
On 1/6/26 13:03, Lance Yang wrote:
From: Lance Yang <lance.yang@xxxxxxxxx>

pmdp_collapse_flush() may already send IPIs to flush TLBs, and then
callers send another IPI via tlb_remove_table_sync_one() or
pmdp_get_lockless_sync() to synchronize with concurrent GUP-fast walkers.

However, since GUP-fast runs with IRQs disabled, the TLB flush IPI already
provides the necessary synchronization. We can avoid the redundant second
IPI.

Introduce pmdp_collapse_flush_sync() which combines flush and sync:

- For architectures using the generic pmdp_collapse_flush() implementation
   (e.g., x86): Use mmu_gather to track IPI sends. If the TLB flush sent
   an IPI, tlb_gather_remove_table_sync_one() will skip the redundant one.

- For architectures with custom pmdp_collapse_flush() (s390, riscv,
   powerpc): Fall back to calling pmdp_collapse_flush() followed by
   tlb_remove_table_sync_one(). No behavior change.

Update khugepaged to use pmdp_collapse_flush_sync() instead of separate
flush and sync calls. Remove the now-unused pmdp_get_lockless_sync() macro.

Suggested-by: David Hildenbrand (Red Hat) <david@xxxxxxxxxx>
Signed-off-by: Lance Yang <lance.yang@xxxxxxxxx>
---
  include/linux/pgtable.h | 13 +++++++++----
  mm/khugepaged.c         |  9 +++------
  mm/pgtable-generic.c    | 34 ++++++++++++++++++++++++++++++++++
  3 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index eb8aacba3698..69e290dab450 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -755,7 +755,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
      return pmd;
  }
  #define pmdp_get_lockless pmdp_get_lockless
-#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
  #endif /* CONFIG_PGTABLE_LEVELS > 2 */
  #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
@@ -774,9 +773,6 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
  {
      return pmdp_get(pmdp);
  }
-static inline void pmdp_get_lockless_sync(void)
-{
-}
  #endif
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -1174,6 +1170,8 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm,
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
  extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
                   unsigned long address, pmd_t *pmdp);
+extern pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma,
+                 unsigned long address, pmd_t *pmdp);
  #else
  static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
                      unsigned long address,
@@ -1182,6 +1180,13 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
      BUILD_BUG();
      return *pmdp;
  }
+static inline pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma,
+                    unsigned long address,
+                    pmd_t *pmdp)
+{
+    BUILD_BUG();
+    return *pmdp;
+}
  #define pmdp_collapse_flush pmdp_collapse_flush
  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
  #endif
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9f790ec34400..0a98afc85c50 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1177,10 +1177,9 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
       * Parallel GUP-fast is fine since GUP-fast will back off when
       * it detects PMD is changed.
       */
-    _pmd = pmdp_collapse_flush(vma, address, pmd);
+    _pmd = pmdp_collapse_flush_sync(vma, address, pmd);
      spin_unlock(pmd_ptl);
      mmu_notifier_invalidate_range_end(&range);
-    tlb_remove_table_sync_one();

Now you issue the IPI under PTL.
We do send TLB flush IPI under PTL before, e.g. in try_collapse_pte_mapped_thp():

pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
pmdp_get_lockless_sync();
pte_unmap_unlock(start_pte, ptl);

But anyway, we can do better by passing ptl in and unlocking
before the sync IPI ;)

[...]

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d3aec7a9926a..be2ee82e6fc4 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -233,6 +233,40 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
      flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
      return pmd;
  }
+
+pmd_t pmdp_collapse_flush_sync(struct vm_area_struct *vma, unsigned long address,
+                   pmd_t *pmdp)
+{
+    struct mmu_gather tlb;
+    pmd_t pmd;
+
+    VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+    VM_BUG_ON(pmd_trans_huge(*pmdp));
+
+    tlb_gather_mmu(&tlb, vma->vm_mm);

Should we be using the new tlb_gather_mmu_vma(), and do we have to set the TLB pagesize to PMD?

Yes, good point on tlb_gather_mmu_vma()!

So, the sequence will be:

tlb_gather_mmu_vma(&tlb, vma);
pmd = pmdp_huge_get_and_clear(...);
flush_tlb_mm_range(..., &tlb);
if (ptl)
spin_unlock(ptl);
tlb_gather_remove_table_sync_one(&tlb);
tlb_finish_mmu(&tlb);Thanks,
Lance