Re: [tip:x86/mm] [x86/mm/tlb] 209954cbc7: will-it-scale.per_thread_ops 13.2% regression

From: Mathieu Desnoyers
Date: Thu Nov 28 2024 - 14:47:09 EST


On 28-Nov-2024 10:57:35 PM, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a 13.2% regression of will-it-scale.per_thread_ops on:
>
>
> commit: 209954cbc7d0ce1a190fc725d20ce303d74d2680 ("x86/mm/tlb: Update mm_cpumask lazily")
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git x86/mm

AFAIU, this commit changes the way TLB flushes are inhibited when
context switching away from a mm. This means that one additional TLB
flush is performed to a given CPU even after it has context switched
away from the mm, and only then is the mm_cpumask cleared for that CPU.

This could result in additional TLB flush IPI overhead in specific
scenarios where the IPIs are typically triggered after a thread has
context-switched out.

May I recommend looking into a scheme similar to rseq mm_cid for this ?
We're already adding a per-mm per-cpu data:

mm_struct:
/**
* @pcpu_cid: Per-cpu current cid.
*
* Keep track of the currently allocated mm_cid for each cpu.
* The per-cpu mm_cid values are serialized by their respective
* runqueue locks.
*/
struct mm_cid __percpu *pcpu_cid;

struct mm_cid {
u64 time;
int cid;
int recent_cid;
};

I suspect you could use a similar per-cpu data structure per-mm
to keep track of the pending TLB flush mask, and update it simply with
load/store to per-CPU data rather than have to cacheline-bounce all over
the place due to frequent mm_cpumask atomic updates.

Then you get all the benefits without introducing a window where useless
TLB flush IPIs get triggered.

Of course it's slightly less compact in terms of memory footprint than a
cpumask, but you gain a lot by removing cache line bouncing on this
frequent context switch code path.

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com