Re: [linus:master] [x86/mm/tlb] 7e33001b8b: will-it-scale.per_thread_ops 20.7% improvement

From: Rik van Riel
Date: Sat Nov 30 2024 - 14:57:11 EST


On Sat, 2024-11-30 at 09:54 -0800, Linus Torvalds wrote:
> On Sat, 30 Nov 2024 at 09:31, Rik van Riel <riel@xxxxxxxxxxx> wrote:
> >
> > 1) Stop using the mm_cpumask altogether on x86
>
> I think you would still want it as a "this is the upper bound" thing
> -
> exactly like your lazy code effectively does now.
>
> It's not giving some precise "these are the CPU's that have TLB
> contents", but instead just a "these CPU's *might* have TLB
> contents".
>
> But that's a *big* win for any single-threaded case, to not have to
> walk over potentially hundreds of CPUs when that thing has only ever
> actually been on one or two cores.
>
> Because a lot of short-lived processes only ever live on a single
> CPU.
>
Good point. We do want to keep optimizations for single
threaded processes in place.

> The benchmarks you are optimizing for - as well as the ones that
> regress - are
>
>  (a) made up micobenchmark loads
>
>  (b) ridiculously many threads
>
> and I think you should take some of what they say with a big pinch of
> salt.
>
> Those "20% difference" numbers aren't actually *real*, is what I'm
> saying.

Agreed that it won't be a 20% difference on real
workloads, but there are a few real world workloads
where these optimizations do make a fairly significant
difference.

For example, this change below made a 2% performance
difference for a memcache style workload on 2 socket
systems back in 2018, when CPU counts were much smaller
than today:

e9d8c6155768 ("x86/mm/tlb: Skip atomic operations for 'init_mm' in
switch_mm_irqs_off()")

>
> > 2) Instead, at context switch time just update
> >    per_cpu variables like cpu_tlbstate.loaded_mm
> >    and friends
>
> See aboive. I think you'll still want to limit the actual real
> situation of "look, ma, I'm a single-threaded compiler".
>
> > 3) At (much rarer) TLB flush time:
> >    - Iterate over all CPUs
>
> Change this to "iterate over mm_cpumask", and I think it will work a
> whole lot better.
>
> Because yes, clearly with just the *pure* lazy mm_cpumask, you won
> some at scheduling time, but you lost a *lot* by just forcing
> pointless stale IPIs instead.

I struggle to think of a way to synchronize clearing
bits from the mm_cpumask that does not involve IPIs,
but I suppose we could rate limit that clearing to
something like once a second?

The rest of the time we could compare whether a
CPU's cpustate_loaded_mm matches the target mm, and
skip sending an IPI to that CPU?

We already seem to be passing info through to
tlb_is_not_lazy, so the logic could all be implemented
inside there if we wanted to.

--
All Rights Reversed.