Re: [PATCH v4 10/12] x86,tlb: do targeted broadcast flushing from tlbbatch code

From: Jann Horn
Date: Mon Jan 13 2025 - 12:49:21 EST


On Mon, Jan 13, 2025 at 6:05 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
> On Sun, Jan 12, 2025 at 4:55 PM Rik van Riel <riel@xxxxxxxxxxx> wrote:
> > Instead of doing a system-wide TLB flush from arch_tlbbatch_flush,
> > queue up asynchronous, targeted flushes from arch_tlbbatch_add_pending.
> >
> > This also allows us to avoid adding the CPUs of processes using broadcast
> > flushing to the batch->cpumask, and will hopefully further reduce TLB
> > flushing from the reclaim and compaction paths.
> [...]
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index 80375ef186d5..532911fbb12a 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -1658,9 +1658,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > * a local TLB flush is needed. Optimize this use-case by calling
> > * flush_tlb_func_local() directly in this case.
> > */
> > - if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> > - invlpgb_flush_all_nonglobals();
> > - } else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> > + if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> > flush_tlb_multi(&batch->cpumask, info);
> > } else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
> > lockdep_assert_irqs_enabled();
> > @@ -1669,12 +1667,49 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > local_irq_enable();
> > }
> >
> > + /*
> > + * If we issued (asynchronous) INVLPGB flushes, wait for them here.
> > + * The cpumask above contains only CPUs that were running tasks
> > + * not using broadcast TLB flushing.
> > + */
> > + if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && batch->used_invlpgb) {
> > + tlbsync();
> > + migrate_enable();
> > + batch->used_invlpgb = false;
> > + }
> > +
> > cpumask_clear(&batch->cpumask);
> >
> > put_flush_tlb_info();
> > put_cpu();
> > }
> >
> > +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> > + struct mm_struct *mm,
> > + unsigned long uaddr)
> > +{
> > + if (static_cpu_has(X86_FEATURE_INVLPGB) && mm_global_asid(mm)) {
> > + u16 asid = mm_global_asid(mm);
> > + /*
> > + * Queue up an asynchronous invalidation. The corresponding
> > + * TLBSYNC is done in arch_tlbbatch_flush(), and must be done
> > + * on the same CPU.
> > + */
> > + if (!batch->used_invlpgb) {
> > + batch->used_invlpgb = true;
> > + migrate_disable();
> > + }
> > + invlpgb_flush_user_nr_nosync(kern_pcid(asid), uaddr, 1, false);
> > + /* Do any CPUs supporting INVLPGB need PTI? */
> > + if (static_cpu_has(X86_FEATURE_PTI))
> > + invlpgb_flush_user_nr_nosync(user_pcid(asid), uaddr, 1, false);
> > + } else {
> > + inc_mm_tlb_gen(mm);
> > + cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> > + }
> > + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
> > +}
>
> How does this work if the MM is currently transitioning to a global
> ASID? Should the "mm_global_asid(mm)" check maybe be replaced with
> something that checks if the MM has fully transitioned to a global
> ASID, so that we keep using the classic path if there might be holdout
> CPUs?

Ah, but if we did that, we'd also have to ensure that the MM switching
path keeps invalidating the TLB when the MM's TLB generation count
increments, even if the CPU has already switched to the global ASID.