Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

From: Marco Elver
Date: Fri Sep 13 2024 - 08:10:13 EST


On Thu, 12 Sept 2024 at 19:34, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
> On 2024-09-12 12:38, Marco Elver wrote:
> > On Mon, 9 Sept 2024 at 23:15, Mathieu Desnoyers
> > <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> >>
> >> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
> >> introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
> >> a reference to the concurrency id allocated for each CPU. This reference
> >> expires shortly after a 100ms delay.
> >>
> >> These per-CPU references keep the per-mm-cid data cache-local in
> >> situations where threads are running at least once on each CPU within
> >> each 100ms window, thus keeping the per-cpu reference alive.
> >
> > One orthogonal idea that I recall: If a thread from a different thread
> > group (i.e. another process) was scheduled on that CPU, the CID can
> > also be invalidated because the caches are likely polluted. Fixed
> > values like 100ms seem rather arbitrary and it may work for one system
> > but not another.
>
> That depends on the cache usage pattern of the different thread group:
> it's also possible that the other thread group does not perform that
> many stores to memory before the original thread group is scheduled
> back, thus keeping the cache content untouched.
>
> The ideal metric there would probably be based on PMU counters, but
> I doubt we want to go there.
>
> [...]
> >
> > I like the simpler and more general approach vs. the NUMA-only
> > approach! Attempting to reallocate the previously assigned CID seems
> > to go a long way.
>
> Indeed it does!
>
> >
> > However, this doesn't quite do L3-awareness as mentioned in [1], right?
> > What I can tell is that this patch improves cache locality for threads
> > scheduled back on the _same CPU_, but not if those threads are
> > scheduled on a _set of CPUs_ sharing the _same L3_ cache. So if e.g. a
> > thread is scheduled from CPU2 to CPU3, but those 2 CPUs share the same
> > L3 cache, that thread will get a completely new CID and is unlikely to
> > hit in the L3 cache when accessing the per-CPU data.
> >
> > [1] https://github.com/google/tcmalloc/issues/144#issuecomment-2307739715
> >
> > Maybe I missed it, or you are planning to add it in future?
>
> In my benchmarks, I noticed that preserving cache-locality at the L1 and
> L2 levels was important as well.
>
> I would like to understand better the use-case you refer to for L3
> locality. AFAIU, this implies a scenario where the scheduler migrates
> a thread from CPU 2 to CPU 3 (both with the same L3), and you would
> like to migrate the concurrency ID along.

Either migrate it along, _or_ pick a CID from a different thread that
ran on a CPU that shares this L3. E.g. if T1 is migrated from CPU2 to
CPU3, and T2 ran on CPU3 before, then it would be ok for T1 to get its
previous CID or T2's CID from when it ran on CPU3. Or more simply,
CIDs aren't tied to particular threads, but tied to a subset of CPUs
based on topology. If the user could specify that topology / CID
affinity would be nice.

> When the number of threads is < number of mm allowed cpus, the
> migrate hooks steal the concurrency ID from CPU 2 and moves it to
> CPU 3 if there is only a single thread from that mm on CPU 2, which
> does what you wish.

Only if the next CPU shares the cache. What if it moves the thread to
a CPU where that CPU's L3 cache != the previous CPU's L3 cache. In
that case, it'd be preferable to pick a last-used CID from the set of
CPUs that are grouped under that L3 cache.

> When the number of threads is >= number of mm allowed cpus, the
> migrate hook is skipped, and the concurrency ID from CPU 2 is
> left in place, favoring cache locality at L1/L2 levels.

... and any higher level caches, too, I'd assume.

> In that
> case it's the scheduler's decision to migrate the thread from
> CPU 2 to CPU 3, so I would think improving the scheduler decisions
> about migration and minimizing thread movement would be more
> relevant than trying to optimize concurrency ID movement.

>From what I gather, if the CID is left in place on a CPU, and the next
thread just grabs it, that's already optimal AFAIK.

Thanks,
-- Marco