Re: [PATCH] x86,tlb: update mm_cpumask lazily

From: Dave Hansen
Date: Fri Nov 08 2024 - 15:32:09 EST


On 11/8/24 11:31, Rik van Riel wrote:
> On busy multi-threaded workloads, there can be significant contention
> on the mm_cpumask at context switch time.
>
> Reduce that contention by updating mm_cpumask lazily, setting the CPU bit
> at context switch time (if not already set), and clearing the CPU bit at
> the first TLB flush sent to a CPU where the process isn't running.
>
> When a flurry of TLB flushes for a process happen, only the first one
> will be sent to CPUs where the process isn't running. The others will
> be sent to CPUs where the process is currently running.

So I guess it comes down to balancing:

The cpumask_clear_cpu() happens on every mm switch which can be
thousands of times a second. But it's _relatively_ cheap: dozens or a
couple hundred cycles.

with:

Skipping the cpumask_clear_cpu() will cause more TLB flushes. It can
cause at most one extra TLB flush for each time a process is migrated
off a CPU and never returns. This is _relatively_ expensive: on the
order of thousands of cycles to send and receive an IPI.

Migrations are obviously the enemy here, but they're the enemy for lots
of _other_ reasons too, which is a really nice property.

The only thing I can think of that really worries me is some kind of
forked worker model where before this patch you would have:

* fork()
* run on CPU A
* ... migrate to CPU B
* malloc()/free(), needs to flush B only
* exit()

and after:

* fork()
* run on CPU A
* ... migrate to CPU B
* malloc()/free(), needs to flush A+B, including IPI
* exit()

Where that IPI wasn't needed at *all* before. But that's totally contrived.

So I think this is the kind of thing we'd want to apply to -rc1 and let
the robots poke at it for a few weeks. But it does seem like a sound
idea to me.