On Mon, 9 Sept 2024 at 23:15, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
a reference to the concurrency id allocated for each CPU. This reference
expires shortly after a 100ms delay.
These per-CPU references keep the per-mm-cid data cache-local in
situations where threads are running at least once on each CPU within
each 100ms window, thus keeping the per-cpu reference alive.
One orthogonal idea that I recall: If a thread from a different thread
group (i.e. another process) was scheduled on that CPU, the CID can
also be invalidated because the caches are likely polluted. Fixed
values like 100ms seem rather arbitrary and it may work for one system
but not another.
I like the simpler and more general approach vs. the NUMA-only
approach! Attempting to reallocate the previously assigned CID seems
to go a long way.
However, this doesn't quite do L3-awareness as mentioned in [1], right?
What I can tell is that this patch improves cache locality for threads
scheduled back on the _same CPU_, but not if those threads are
scheduled on a _set of CPUs_ sharing the _same L3_ cache. So if e.g. a
thread is scheduled from CPU2 to CPU3, but those 2 CPUs share the same
L3 cache, that thread will get a completely new CID and is unlikely to
hit in the L3 cache when accessing the per-CPU data.
[1] https://github.com/google/tcmalloc/issues/144#issuecomment-2307739715
Maybe I missed it, or you are planning to add it in future?
In any case, the current patch is definitely an improvement:
Acked-by: Marco Elver <elver@xxxxxxxxxx>
Thanks,
-- Marco