Re: [PATCH v2 0/3] skip redundant TLB sync IPIs
From: Lance Yang
Date: Sat Jan 03 2026 - 03:39:33 EST
On 2026/1/3 00:41, Dave Hansen wrote:
On 12/31/25 04:33, David Hildenbrand (Red Hat) wrote:
On 12/31/25 05:26, Dave Hansen wrote:
On 12/29/25 06:52, Lance Yang wrote:
...
This series introduces a way for architectures to indicate their TLB
flush
already provides full synchronization, allowing the redundant IPI to be
skipped. For now, the optimization is implemented for x86 first and
applied
to all page table operations that free or unshare tables.
I really don't like all the complexity here. Even on x86, there are
three or more ways of deriving this. Having the pv_ops check the value
of another pv op is also a bit unsettling.
Right. What I actually meant is that we simply have a property "bool
flush_tlb_multi_implies_ipi_broadcast" that we set only to true from the
initialization code.
Without comparing the pv_ops.
That should reduce the complexity quite a bit IMHO.
Yeah, that sounds promising.
Thanks a lot for taking the time to review!
Yeah, I simplified things to just a bool property set during init
(no pv_ops comparison at runtime) as follows:
---8<---
diff --git a/arch/x86/include/asm/paravirt.h
b/arch/x86/include/asm/paravirt.h
index 13f9cd31c8f8..a926d459e6f5 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -698,6 +698,7 @@ static __always_inline unsigned long
arch_local_irq_save(void)
extern void default_banner(void);
void native_pv_lock_init(void) __init;
+void setup_pv_tlb_flush_ipi_broadcast(void) __init;
#else /* __ASSEMBLER__ */
@@ -727,6 +728,10 @@ void native_pv_lock_init(void) __init;
static inline void native_pv_lock_init(void)
{
}
+
+static inline void setup_pv_tlb_flush_ipi_broadcast(void)
+{
+}
#endif
#endif /* !CONFIG_PARAVIRT */
diff --git a/arch/x86/include/asm/paravirt_types.h
b/arch/x86/include/asm/paravirt_types.h
index 3502939415ad..7c010d8bee60 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -133,6 +133,12 @@ struct pv_mmu_ops {
void (*flush_tlb_multi)(const struct cpumask *cpus,
const struct flush_tlb_info *info);
+ /*
+ * Indicates whether flush_tlb_multi IPIs provide sufficient
+ * synchronization for GUP-fast when freeing or unsharing page tables.
+ */
+ bool flush_tlb_multi_implies_ipi_broadcast;
+
/* Hook for intercepting the destruction of an mm_struct. */
void (*exit_mmap)(struct mm_struct *mm);
void (*notify_page_enc_status_changed)(unsigned long pfn, int npages,
bool enc);
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..f570c7b2d03e 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -5,10 +5,23 @@
#define tlb_flush tlb_flush
static inline void tlb_flush(struct mmu_gather *tlb);
+#define tlb_table_flush_implies_ipi_broadcast
tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
+
#include <asm-generic/tlb.h>
#include <linux/kernel.h>
#include <vdso/bits.h>
#include <vdso/page.h>
+#include <asm/paravirt.h>
+
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+#ifdef CONFIG_PARAVIRT
+ return pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast;
+#else
+ return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
+#endif
+}
static inline void tlb_flush(struct mmu_gather *tlb)
{
@@ -20,7 +33,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
end = tlb->end;
}
- flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+ flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
+ tlb->freed_tables || tlb->unshared_tables);
}
static inline void invlpg(unsigned long addr)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab3e172dcc69..0a49c2d79693 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -60,6 +60,23 @@ void __init native_pv_lock_init(void)
static_branch_enable(&virt_spin_lock_key);
}
+void __init setup_pv_tlb_flush_ipi_broadcast(void)
+{
+ /*
+ * For native TLB flush, if we don't have INVLPGB, we use IPI-based
+ * flushing which sends real IPIs to all CPUs. This provides sufficient
+ * synchronization for GUP-fast.
+ *
+ * For paravirt (e.g., KVM, Xen, HyperV), hypercalls may not send real
+ * IPIs, so we keep the default value of false. Only set to true when
+ * using native flush_tlb_multi without INVLPGB.
+ */
+ if (pv_ops.mmu.flush_tlb_multi == native_flush_tlb_multi &&
+ !cpu_feature_enabled(X86_FEATURE_INVLPGB))
+ pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast = true;
+}
+
+
struct static_key paravirt_steal_enabled;
struct static_key paravirt_steal_rq_enabled;
@@ -173,6 +190,7 @@ struct paravirt_patch_template pv_ops = {
.mmu.flush_tlb_kernel = native_flush_tlb_global,
.mmu.flush_tlb_one_user = native_flush_tlb_one_user,
.mmu.flush_tlb_multi = native_flush_tlb_multi,
+ .mmu.flush_tlb_multi_implies_ipi_broadcast = false,
.mmu.exit_mmap = paravirt_nop,
.mmu.notify_page_enc_status_changed = paravirt_nop,
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 74aa904be6dc..3f673e686b12 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -51,6 +51,7 @@
#include <asm/olpc_ofw.h>
#include <asm/pci-direct.h>
#include <asm/prom.h>
+#include <asm/paravirt.h>
#include <asm/proto.h>
#include <asm/realmode.h>
#include <asm/thermal.h>
@@ -1257,6 +1258,7 @@ void __init setup_arch(char **cmdline_p)
io_apic_init_mappings();
x86_init.hyper.guest_late_init();
+ setup_pv_tlb_flush_ipi_broadcast();
e820__reserve_resources();
e820__register_nosave_regions(max_pfn);
---
But maybe you have an even better way on how to indicate support, in a
very simple way.
Rather than having some kind of explicit support enumeration, the other
idea I had would be to actually track the state about what needs to get
flushed somewhere. For instance, even CPUs with enabled INVLPGB support
still use IPIs sometimes. That makes the
tlb_table_flush_implies_ipi_broadcast() check a bit imperfect as is
because it will for the extra sync IPI even when INVLPGB isn't being
used for an mm.
First, we already save some semblance of support for doing different
flushes when freeing page tables mmu_gather->freed_tables. But, the call
sites in question here are for a single flush and don't use mmu_gathers.
The other pretty straightforward thing to do would be to add something
to mm->context that indicates that page tables need to be freed but
there might still be wild gup walkers out there that need an IPI. It
would get set when the page tables are modified and cleared at all the
sites where an IPIs are sent.
Thanks for the suggestion! The mm->context tracking idea makes a lot
of sense - it would handle those mixed INVLPGB/IPI cases much better :)
Maybe we could do that as a follow-up. I'd like to keep things simple
for now, so we just add a bool property to skip redundant TLB sync IPIs
on systems without INVLPGB support.
Then we could add the mm->context (or something similar) tracking later
to handle things more precisely.
Anyway, I'm open to going straight to the mm->context approach as well
and happy to do that instead :D
Thanks,
Lance
That said, complexity can be worth it with sufficient demonstrated
gains. But:
When unsharing hugetlb PMD page tables or collapsing pages in
khugepaged,
we send two IPIs: one for TLB invalidation, and another to synchronize
with concurrent GUP-fast walkers.
Those aren't exactly hot paths. khugepaged is fundamentally rate
limited. I don't think unsharing hugetlb PMD page tables just is all
that common either.
Given that the added IPIs during unsharing broke Oracle DBs rather badly
[1], I think this is actually a case worth optimizing.
...
[1] https://lkml.kernel.org/r/20251223214037.580860-1-david@xxxxxxxxxx
Gah, that's good context, thanks.
Are there any tests out there that might catch these this case better?
It might be something good to have 0day watch for.