Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

From: Mathieu Desnoyers
Date: Wed Oct 02 2024 - 08:47:35 EST


On 2024-10-02 11:49, Marco Elver wrote:
On Mon, 30 Sept 2024 at 21:01, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:

commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
a reference to the concurrency id allocated for each CPU. This reference
expires shortly after a 100ms delay.

These per-CPU references keep the per-mm-cid data cache-local in
situations where threads are running at least once on each CPU within
each 100ms window, thus keeping the per-cpu reference alive.

However, intermittent workloads behaving in bursts spaced by more than
100ms on each CPU exhibit bad cache locality and degraded performance
compared to purely per-cpu data indexing, because concurrency IDs are
allocated over various CPUs and cores, therefore losing cache locality
of the associated data.

Introduce the following changes to improve per-mm-cid cache locality:

- Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
track of which mm_cid value was last used, and use it as a hint to
attempt re-allocating the same concurrency ID the next time this
mm/cpu needs to allocate a concurrency ID,

- Add a per-mm CPUs allowed mask, which keeps track of the union of
CPUs allowed for all threads belonging to this mm. This cpumask is
only set during the lifetime of the mm, never cleared, so it
represents the union of all the CPUs allowed since the beginning of
the mm lifetime. (note that the mm_cpumask() is really arch-specific
and tailored to the TLB flush needs, and is thus _not_ a viable
approach for this)

- Add a per-mm nr_cpus_allowed to keep track of the weight of the
per-mm CPUs allowed mask (for fast access),

- Add a per-mm nr_cids_used to keep track of the highest concurrency
ID allocated for the mm. This is used for expanding the concurrency ID
allocation within the upper bound defined by:

min(mm->nr_cpus_allowed, mm->mm_users)

When the next unused CID value reaches this threshold, stop trying
to expand the cid allocation and use the first available cid value
instead.

Spreading allocation to use all the cid values within the range

[ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]

improves cache locality while preserving mm_cid compactness within the
expected user limits.

- In __mm_cid_try_get, only return cid values within the range
[ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
prevents allocating cids above the number of allowed cpus in
rare scenarios where cid allocation races with a concurrent
remote-clear of the per-mm/cpu cid. This improvement is made
possible by the addition of the per-mm CPUs allowed mask.

- In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
t->nr_cpus_allowed. This criterion was really meant to compare
the number of mm->mm_users to the number of CPUs allowed for the
entire mm. Therefore, the prior comparison worked fine when all
threads shared the same CPUs allowed mask, but not so much in
scenarios where those threads have different masks (e.g. each
thread pinned to a single CPU). This improvement is made
possible by the addition of the per-mm CPUs allowed mask.

* Benchmarks

Each thread increments 16kB worth of 8-bit integers in bursts, with
a configurable delay between each thread's execution. Each thread run
one after the other (no threads run concurrently). The order of
thread execution in the sequence is random. The thread execution
sequence begins again after all threads have executed. The 16kB areas
are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
(not cache-local), or cache-local mm_cid. Each thread is pinned to its
own core.

Testing configurations:

8-core/1-L3: Use 8 cores within a single L3
24-core/24-L3: Use 24 cores, 1 core per L3
192-core/24-L3: Use 192 cores (all cores in the system)
384-thread/24-L3: Use 384 HW threads (all HW threads in the system)

Intermittent workload delays between threads: 200ms, 10ms.

Hardware:

CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Caches (sum of all):
L1d: 6 MiB (192 instances)
L1i: 6 MiB (192 instances)
L2: 192 MiB (192 instances)
L3: 768 MiB (24 instances)

Each result is an average of 5 test runs. The cache-local speedup
is calculated as: (cache-local mm_cid) / (mm_cid).

Intermittent workload delay: 200ms

per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 1374 19289 1336 14.4x
24-core/24-L3 2423 26721 1594 16.7x
192-core/24-L3 2291 15826 2153 7.3x
384-thread/24-L3 1874 13234 1907 6.9x

Intermittent workload delay: 10ms

per-cpu mm_cid cache-local mm_cid cache-local speedup
(ns) (ns) (ns)
8-core/1-L3 662 756 686 1.1x
24-core/24-L3 1378 3648 1035 3.5x
192-core/24-L3 1439 10833 1482 7.3x
384-thread/24-L3 1503 10570 1556 6.8x

[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
patch series with a simpler and more general approach. ]

Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@xxxxxxxxxxxx/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Acked-by: Marco Elver <elver@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
Cc: Ben Segall <bsegall@xxxxxxxxxx>
Cc: Dmitry Vyukov <dvyukov@xxxxxxxxxx>
Cc: Marco Elver <elver@xxxxxxxxxx>
Cc: Yury Norov <yury.norov@xxxxxxxxx>
Cc: Rasmus Villemoes <linux@xxxxxxxxxxxxxxxxxx>
---
Changes since v0:
- On migration, do not move the source cid to the destination cpu if the
destination cpu has a recent cid value set.

Changes since v2:
- Rebase on v6.11.1.

I think the versioning and changelog got confused. I see the changes
from [1] which was already v2 are included in this one.

[1] https://lore.kernel.org/all/5cf2c0a5-7a99-4294-b316-eee07896ddf6@xxxxxxxxxxxx/T/#u

Which means I should have tagged this series [PATCH v3]. Sorry about
that.


In any case, I'll reiterate my Ack as this looks like an improvement
for the common case.

Acked-by: Marco Elver <elver@xxxxxxxxxx>

Thanks!

Peter, should I re-send as is with a v3 tag, or is it OK for merge ?

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com