Re: [PATCH RESEND v3 1/2] mm/tlb: skip redundant IPI when TLB flush already synchronized

From: Dave Hansen

Date: Tue Jan 06 2026 - 11:25:00 EST


On 1/6/26 04:03, Lance Yang wrote:
> From: Lance Yang <lance.yang@xxxxxxxxx>
>
> When unsharing hugetlb PMD page tables, we currently send two IPIs: one
> for TLB invalidation, and another to synchronize with concurrent GUP-fast
> walkers via tlb_remove_table_sync_one().
>
> However, if the TLB flush already sent IPIs to all CPUs (when freed_tables
> or unshared_tables is true), the second IPI is redundant. GUP-fast runs
> with IRQs disabled, so when the TLB flush IPI completes, any concurrent
> GUP-fast must have finished.
>
> To avoid the redundant IPI, we add a flag to mmu_gather to track whether
> the TLB flush sent IPIs. We pass the mmu_gather pointer through the TLB
> flush path via flush_tlb_info, so native_flush_tlb_multi() can set the
> flag when it sends IPIs for freed_tables. We also set the flag for
> local-only flushes, since disabling IRQs provides the same guarantee.

The lack of imperative voice is killing me. :)

> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 866ea78ba156..c5950a92058c 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -20,7 +20,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
> end = tlb->end;
> }
>
> - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
> + flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
> + tlb->freed_tables || tlb->unshared_tables, tlb);
> }

I think this hunk sums up v3 pretty well. Where there was a single boolean, now there are two. To add to that, the structure that contains the booleans is itself being passed in. The boolean is still named 'freed_tables', and is going from:

tlb->freed_tables

which is pretty obviously correct to:

tlb->freed_tables || tlb->unshared_tables

which is _far_ from obviously correct.

I'm at a loss for why the patch wouldn't just do this:

- flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+ flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb);

I suspect these were sent out in a bit of haste, which isn't the first time I've gotten that feeling with this series.

Could we slow down, please?

> static inline void invlpg(unsigned long addr)
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 00daedfefc1b..83c260c88b80 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -220,6 +220,7 @@ struct flush_tlb_info {
> * will be zero.
> */
> struct mm_struct *mm;
> + struct mmu_gather *tlb;
> unsigned long start;
> unsigned long end;
> u64 new_tlb_gen;

This also gives me pause.

There is a *lot* of redundant information between 'struct mmu_gather' and 'struct tlb_flush_info'. There needs to at least be a description of what the relationship is and how these relate to each other. I would have naively thought that the right move here would be to pull the mmu_gather data out at one discrete time rather than store a pointer to it.

What I see here is, I suspect, the most expedient way to do it. I'd _certainly_ have done this myself if I was just hacking something together to play with as quickly as possible.

So, in the end, I don't hate the approach here (yet). But it is almost impossible to evaluate it because the series is taking some rather egregious shortcuts and is lacking any real semblance of a refactoring effort.