Re: [tip:x86/mm] [x86/mm/tlb] 209954cbc7: will-it-scale.per_thread_ops 13.2% regression

From: Mathieu Desnoyers
Date: Mon Dec 02 2024 - 11:38:55 EST

Next message: Jan Kara: "Re: [PATCH RFC 4/6] fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh()"
Previous message: Nathan Chancellor: "Re: [PATCH 3/3] media: mediatek: vcodec: Workaround a compiler warning"
Next in thread: Rik van Riel: "Re: [tip:x86/mm] [x86/mm/tlb] 209954cbc7: will-it-scale.per_thread_ops 13.2% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2024-11-28 21:52, Rik van Riel wrote:

On Thu, 2024-11-28 at 14:46 -0500, Mathieu Desnoyers wrote:

I suspect you could use a similar per-cpu data structure per-mm
to keep track of the pending TLB flush mask, and update it simply
with
load/store to per-CPU data rather than have to cacheline-bounce all
over
the place due to frequent mm_cpumask atomic updates.

Then you get all the benefits without introducing a window where
useless
TLB flush IPIs get triggered.

Of course it's slightly less compact in terms of memory footprint
than a
cpumask, but you gain a lot by removing cache line bouncing on this
frequent context switch code path.

Thoughts ?

The first thought that comes to mind is that we already
have a per-CPU variable indicating which is the currently
loaded mm on that CPU.

Only on x86 though.

We could probably just skip sending IPIs to CPUs that do
not have the mm_struct currently loaded.

This can race against switch_mm_irqs_off() on a CPU
switching to that mm simultaneously with the TLB flush,
which should be fine because that CPU cannot load TLB
entries from previously cleared page tables.

However, it does mean we cannot safely clear bits
out of the mm_cpumask, because a race between clearing
the bit on one CPU, and setting it on another would not
be something we could easily catch at all, unless we
can figure out some clever memory ordering thing there.

Or we just build a per-cpu mm_cpumask from per-CPU state
every time we want to use the mm_cpumask. But AFAIU this
is going to be a tradeoff between:

- Overhead of context switch at scale

(e.g. will-it-scale:)
for a in $(seq 1 2); do (./context_switch1_threads -t 192 -s 20 &); done

For reference, my POC reaches 50% performance improvement with this.

vs

- Overhead of TLB flush

(e.g. will-it-scale:)
./tlb_flush2_threads -t 192 -s 20

For reference, my POC has about 33% regression on that test case due
to extra work when using mm_cpumask.

So I guess what we end up doing really depends which scenario we consider
most frequent.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Next message: Jan Kara: "Re: [PATCH RFC 4/6] fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh()"
Previous message: Nathan Chancellor: "Re: [PATCH 3/3] media: mediatek: vcodec: Workaround a compiler warning"
Next in thread: Rik van Riel: "Re: [tip:x86/mm] [x86/mm/tlb] 209954cbc7: will-it-scale.per_thread_ops 13.2% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]