Re: [PATCH v11 08/12] x86/mm: use broadcast TLB flushing for page reclaim TLB flushing

From: Dave Hansen
Date: Fri Feb 14 2025 - 13:52:03 EST


On 2/13/25 08:13, Rik van Riel wrote:
> In the page reclaim code, we only track the CPU(s) where the TLB needs
> to be flushed, rather than all the individual mappings that may be getting
> invalidated.
>
> Use broadcast TLB flushing when that is available.

The changelog here is a little light. This patch is doing a *ton* of stuff.

The existing code has two cases where it is doing a full TLB flush, not
a ranged flush.

1. An actual IPI to some CPUs in batch->cpumask
2. A local flush, no IPI

The change here eliminates both of those options, even the "common case"
which is not sending an IPI at all. So this replaces a CPU-local (aka. 1
logical CPU) TLB flush with a broadcast to the *ENTIRE* system. That's a
really really big change to not be noted. It's not something that's an
obvious win to me.

> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 3c29ef25dce4..de3f6e4ed16d 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1316,7 +1316,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> * a local TLB flush is needed. Optimize this use-case by calling
> * flush_tlb_func_local() directly in this case.
> */
> - if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> + invlpgb_flush_all_nonglobals();
> + } else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> flush_tlb_multi(&batch->cpumask, info);
> } else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
> lockdep_assert_irqs_enabled();

The structure of the code is also a bit off to me. O'd kinda prefer that
we stick the pattern of (logically):

if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
invlpgb_...();
} else {
on_each_cpu*();
}

This patch is going a couple of functions up in the call chain above the
on_each_cpu()'s.

It would be more consistent with the previous modifications in this
series if the X86_FEATURE_INVLPGB check was in native_flush_tlb_multi(),
instead.

Would that make sense here? It would also preserve the "common case"
optimization that's in arch_tlbbatch_flush().