Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

From: Mathieu Desnoyers
Date: Thu Sep 12 2024 - 13:34:52 EST


On 2024-09-12 12:38, Marco Elver wrote:
On Mon, 9 Sept 2024 at 23:15, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:

commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
a reference to the concurrency id allocated for each CPU. This reference
expires shortly after a 100ms delay.

These per-CPU references keep the per-mm-cid data cache-local in
situations where threads are running at least once on each CPU within
each 100ms window, thus keeping the per-cpu reference alive.

One orthogonal idea that I recall: If a thread from a different thread
group (i.e. another process) was scheduled on that CPU, the CID can
also be invalidated because the caches are likely polluted. Fixed
values like 100ms seem rather arbitrary and it may work for one system
but not another.

That depends on the cache usage pattern of the different thread group:
it's also possible that the other thread group does not perform that
many stores to memory before the original thread group is scheduled
back, thus keeping the cache content untouched.

The ideal metric there would probably be based on PMU counters, but
I doubt we want to go there.

[...]

I like the simpler and more general approach vs. the NUMA-only
approach! Attempting to reallocate the previously assigned CID seems
to go a long way.

Indeed it does!


However, this doesn't quite do L3-awareness as mentioned in [1], right?
What I can tell is that this patch improves cache locality for threads
scheduled back on the _same CPU_, but not if those threads are
scheduled on a _set of CPUs_ sharing the _same L3_ cache. So if e.g. a
thread is scheduled from CPU2 to CPU3, but those 2 CPUs share the same
L3 cache, that thread will get a completely new CID and is unlikely to
hit in the L3 cache when accessing the per-CPU data.

[1] https://github.com/google/tcmalloc/issues/144#issuecomment-2307739715

Maybe I missed it, or you are planning to add it in future?

In my benchmarks, I noticed that preserving cache-locality at the L1 and
L2 levels was important as well.

I would like to understand better the use-case you refer to for L3
locality. AFAIU, this implies a scenario where the scheduler migrates
a thread from CPU 2 to CPU 3 (both with the same L3), and you would
like to migrate the concurrency ID along.

When the number of threads is < number of mm allowed cpus, the
migrate hooks steal the concurrency ID from CPU 2 and moves it to
CPU 3 if there is only a single thread from that mm on CPU 2, which
does what you wish.

When the number of threads is >= number of mm allowed cpus, the
migrate hook is skipped, and the concurrency ID from CPU 2 is
left in place, favoring cache locality at L1/L2 levels. In that
case it's the scheduler's decision to migrate the thread from
CPU 2 to CPU 3, so I would think improving the scheduler decisions
about migration and minimizing thread movement would be more
relevant than trying to optimize concurrency ID movement.

But I may not be fully understanding your use-case.


In any case, the current patch is definitely an improvement:

Acked-by: Marco Elver <elver@xxxxxxxxxx>

Thanks a lot for your feedback!

Mathieu


Thanks,
-- Marco

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com