Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

From: Marco Elver
Date: Thu Sep 12 2024 - 12:39:14 EST


On Mon, 9 Sept 2024 at 23:15, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
> introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
> a reference to the concurrency id allocated for each CPU. This reference
> expires shortly after a 100ms delay.
>
> These per-CPU references keep the per-mm-cid data cache-local in
> situations where threads are running at least once on each CPU within
> each 100ms window, thus keeping the per-cpu reference alive.

One orthogonal idea that I recall: If a thread from a different thread
group (i.e. another process) was scheduled on that CPU, the CID can
also be invalidated because the caches are likely polluted. Fixed
values like 100ms seem rather arbitrary and it may work for one system
but not another.

> However, intermittent workloads behaving in bursts spaced by more than
> 100ms on each CPU exhibit bad cache locality and degraded performance
> compared to purely per-cpu data indexing, because concurrency IDs are
> allocated over various CPUs and cores, therefore losing cache locality
> of the associated data.
>
> Introduce the following changes to improve per-mm-cid cache locality:
>
> - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
> track of which mm_cid value was last used, and use it as a hint to
> attempt re-allocating the same concurrency ID the next time this
> mm/cpu needs to allocate a concurrency ID,
>
> - Add a per-mm CPUs allowed mask, which keeps track of the union of
> CPUs allowed for all threads belonging to this mm. This cpumask is
> only set during the lifetime of the mm, never cleared, so it
> represents the union of all the CPUs allowed since the beginning of
> the mm lifetime. (note that the mm_cpumask() is really arch-specific
> and tailored to the TLB flush needs, and is thus _not_ a viable
> approach for this)
>
> - Add a per-mm nr_cpus_allowed to keep track of the weight of the
> per-mm CPUs allowed mask (for fast access),
>
> - Add a per-mm nr_cids_used to keep track of the highest concurrency
> ID allocated for the mm. This is used for expanding the concurrency ID
> allocation within the upper bound defined by:
>
> min(mm->nr_cpus_allowed, mm->mm_users)
>
> When the next unused CID value reaches this threshold, stop trying
> to expand the cid allocation and use the first available cid value
> instead.
>
> Spreading allocation to use all the cid values within the range
>
> [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
>
> improves cache locality while preserving mm_cid compactness within the
> expected user limits.
>
> - In __mm_cid_try_get, only return cid values within the range
> [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
> prevents allocating cids above the number of allowed cpus in
> rare scenarios where cid allocation races with a concurrent
> remote-clear of the per-mm/cpu cid. This improvement is made
> possible by the addition of the per-mm CPUs allowed mask.
>
> - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
> t->nr_cpus_allowed. This criterion was really meant to compare
> the number of mm->mm_users to the number of CPUs allowed for the
> entire mm. Therefore, the prior comparison worked fine when all
> threads shared the same CPUs allowed mask, but not so much in
> scenarios where those threads have different masks (e.g. each
> thread pinned to a single CPU). This improvement is made
> possible by the addition of the per-mm CPUs allowed mask.
>
> * Benchmarks
>
> Each thread increments 16kB worth of 8-bit integers in bursts, with
> a configurable delay between each thread's execution. Each thread run
> one after the other (no threads run concurrently). The order of
> thread execution in the sequence is random. The thread execution
> sequence begins again after all threads have executed. The 16kB areas
> are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
> (not cache-local), or cache-local mm_cid. Each thread is pinned to its
> own core.
>
> Testing configurations:
>
> 8-core/1-L3: Use 8 cores within a single L3
> 24-core/24-L3: Use 24 cores, 1 core per L3
> 192-core/24-L3: Use 192 cores (all cores in the system)
> 384-thread/24-L3: Use 384 HW threads (all HW threads in the system)
>
> Intermittent workload delays between threads: 200ms, 10ms.
>
> Hardware:
>
> CPU(s): 384
> On-line CPU(s) list: 0-383
> Vendor ID: AuthenticAMD
> Model name: AMD EPYC 9654 96-Core Processor
> Thread(s) per core: 2
> Core(s) per socket: 96
> Socket(s): 2
> Caches (sum of all):
> L1d: 6 MiB (192 instances)
> L1i: 6 MiB (192 instances)
> L2: 192 MiB (192 instances)
> L3: 768 MiB (24 instances)
>
> Each result is an average of 5 test runs. The cache-local speedup
> is calculated as: (cache-local mm_cid) / (mm_cid).
>
> Intermittent workload delay: 200ms
>
> per-cpu mm_cid cache-local mm_cid cache-local speedup
> (ns) (ns) (ns)
> 8-core/1-L3 1374 19289 1336 14.4x
> 24-core/24-L3 2423 26721 1594 16.7x
> 192-core/24-L3 2291 15826 2153 7.3x
> 384-thread/24-L3 1874 13234 1907 6.9x
>
> Intermittent workload delay: 10ms
>
> per-cpu mm_cid cache-local mm_cid cache-local speedup
> (ns) (ns) (ns)
> 8-core/1-L3 662 756 686 1.1x
> 24-core/24-L3 1378 3648 1035 3.5x
> 192-core/24-L3 1439 10833 1482 7.3x
> 384-thread/24-L3 1503 10570 1556 6.8x
>
> [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
> patch series with a simpler and more general approach. ]

I like the simpler and more general approach vs. the NUMA-only
approach! Attempting to reallocate the previously assigned CID seems
to go a long way.

However, this doesn't quite do L3-awareness as mentioned in [1], right?
What I can tell is that this patch improves cache locality for threads
scheduled back on the _same CPU_, but not if those threads are
scheduled on a _set of CPUs_ sharing the _same L3_ cache. So if e.g. a
thread is scheduled from CPU2 to CPU3, but those 2 CPUs share the same
L3 cache, that thread will get a completely new CID and is unlikely to
hit in the L3 cache when accessing the per-CPU data.

[1] https://github.com/google/tcmalloc/issues/144#issuecomment-2307739715

Maybe I missed it, or you are planning to add it in future?

In any case, the current patch is definitely an improvement:

Acked-by: Marco Elver <elver@xxxxxxxxxx>

Thanks,
-- Marco