On Thu, 2024-11-28 at 14:46 -0500, Mathieu Desnoyers wrote:
I suspect you could use a similar per-cpu data structure per-mm
to keep track of the pending TLB flush mask, and update it simply
with
load/store to per-CPU data rather than have to cacheline-bounce all
over
the place due to frequent mm_cpumask atomic updates.
Then you get all the benefits without introducing a window where
useless
TLB flush IPIs get triggered.
Of course it's slightly less compact in terms of memory footprint
than a
cpumask, but you gain a lot by removing cache line bouncing on this
frequent context switch code path.
Thoughts ?
The first thought that comes to mind is that we already
have a per-CPU variable indicating which is the currently
loaded mm on that CPU.
We could probably just skip sending IPIs to CPUs that do
not have the mm_struct currently loaded.
This can race against switch_mm_irqs_off() on a CPU
switching to that mm simultaneously with the TLB flush,
which should be fine because that CPU cannot load TLB
entries from previously cleared page tables.
However, it does mean we cannot safely clear bits
out of the mm_cpumask, because a race between clearing
the bit on one CPU, and setting it on another would not
be something we could easily catch at all, unless we
can figure out some clever memory ordering thing there.