Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

From: Marco Elver
Date: Wed Oct 02 2024 - 05:50:35 EST

Next message: Krzysztof Kozlowski: "Re: [PATCH v3 1/2] dt-bindings: i2c: snps,designware-i2c: declare bus capacitance and clk freq optimized"
Previous message: James Clark: "Re: [PATCHSET 0/8] perf tools: Do not set attr.exclude_guest by default (v4)"
Next in thread: Mathieu Desnoyers: "Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 30 Sept 2024 at 21:01, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
> introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
> a reference to the concurrency id allocated for each CPU. This reference
> expires shortly after a 100ms delay.
>
> These per-CPU references keep the per-mm-cid data cache-local in
> situations where threads are running at least once on each CPU within
> each 100ms window, thus keeping the per-cpu reference alive.
>
> However, intermittent workloads behaving in bursts spaced by more than
> 100ms on each CPU exhibit bad cache locality and degraded performance
> compared to purely per-cpu data indexing, because concurrency IDs are
> allocated over various CPUs and cores, therefore losing cache locality
> of the associated data.
>
> Introduce the following changes to improve per-mm-cid cache locality:
>
> - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
> track of which mm_cid value was last used, and use it as a hint to
> attempt re-allocating the same concurrency ID the next time this
> mm/cpu needs to allocate a concurrency ID,
>
> - Add a per-mm CPUs allowed mask, which keeps track of the union of
> CPUs allowed for all threads belonging to this mm. This cpumask is
> only set during the lifetime of the mm, never cleared, so it
> represents the union of all the CPUs allowed since the beginning of
> the mm lifetime. (note that the mm_cpumask() is really arch-specific
> and tailored to the TLB flush needs, and is thus _not_ a viable
> approach for this)
>
> - Add a per-mm nr_cpus_allowed to keep track of the weight of the
> per-mm CPUs allowed mask (for fast access),
>
> - Add a per-mm nr_cids_used to keep track of the highest concurrency
> ID allocated for the mm. This is used for expanding the concurrency ID
> allocation within the upper bound defined by:
>
> min(mm->nr_cpus_allowed, mm->mm_users)
>
> When the next unused CID value reaches this threshold, stop trying
> to expand the cid allocation and use the first available cid value
> instead.
>
> Spreading allocation to use all the cid values within the range
>
> [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
>
> improves cache locality while preserving mm_cid compactness within the
> expected user limits.
>
> - In __mm_cid_try_get, only return cid values within the range
> [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
> prevents allocating cids above the number of allowed cpus in
> rare scenarios where cid allocation races with a concurrent
> remote-clear of the per-mm/cpu cid. This improvement is made
> possible by the addition of the per-mm CPUs allowed mask.
>
> - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
> t->nr_cpus_allowed. This criterion was really meant to compare
> the number of mm->mm_users to the number of CPUs allowed for the
> entire mm. Therefore, the prior comparison worked fine when all
> threads shared the same CPUs allowed mask, but not so much in
> scenarios where those threads have different masks (e.g. each
> thread pinned to a single CPU). This improvement is made
> possible by the addition of the per-mm CPUs allowed mask.
>
> * Benchmarks
>
> Each thread increments 16kB worth of 8-bit integers in bursts, with
> a configurable delay between each thread's execution. Each thread run
> one after the other (no threads run concurrently). The order of
> thread execution in the sequence is random. The thread execution
> sequence begins again after all threads have executed. The 16kB areas
> are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
> (not cache-local), or cache-local mm_cid. Each thread is pinned to its
> own core.
>
> Testing configurations:
>
> 8-core/1-L3: Use 8 cores within a single L3
> 24-core/24-L3: Use 24 cores, 1 core per L3
> 192-core/24-L3: Use 192 cores (all cores in the system)
> 384-thread/24-L3: Use 384 HW threads (all HW threads in the system)
>
> Intermittent workload delays between threads: 200ms, 10ms.
>
> Hardware:
>
> CPU(s): 384
> On-line CPU(s) list: 0-383
> Vendor ID: AuthenticAMD
> Model name: AMD EPYC 9654 96-Core Processor
> Thread(s) per core: 2
> Core(s) per socket: 96
> Socket(s): 2
> Caches (sum of all):
> L1d: 6 MiB (192 instances)
> L1i: 6 MiB (192 instances)
> L2: 192 MiB (192 instances)
> L3: 768 MiB (24 instances)
>
> Each result is an average of 5 test runs. The cache-local speedup
> is calculated as: (cache-local mm_cid) / (mm_cid).
>
> Intermittent workload delay: 200ms
>
> per-cpu mm_cid cache-local mm_cid cache-local speedup
> (ns) (ns) (ns)
> 8-core/1-L3 1374 19289 1336 14.4x
> 24-core/24-L3 2423 26721 1594 16.7x
> 192-core/24-L3 2291 15826 2153 7.3x
> 384-thread/24-L3 1874 13234 1907 6.9x
>
> Intermittent workload delay: 10ms
>
> per-cpu mm_cid cache-local mm_cid cache-local speedup
> (ns) (ns) (ns)
> 8-core/1-L3 662 756 686 1.1x
> 24-core/24-L3 1378 3648 1035 3.5x
> 192-core/24-L3 1439 10833 1482 7.3x
> 384-thread/24-L3 1503 10570 1556 6.8x
>
> [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
> patch series with a simpler and more general approach. ]
>
> Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@xxxxxxxxxxxx/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> Acked-by: Marco Elver <elver@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> Cc: Ben Segall <bsegall@xxxxxxxxxx>
> Cc: Dmitry Vyukov <dvyukov@xxxxxxxxxx>
> Cc: Marco Elver <elver@xxxxxxxxxx>
> Cc: Yury Norov <yury.norov@xxxxxxxxx>
> Cc: Rasmus Villemoes <linux@xxxxxxxxxxxxxxxxxx>
> ---
> Changes since v0:
> - On migration, do not move the source cid to the destination cpu if the
> destination cpu has a recent cid value set.
>
> Changes since v2:
> - Rebase on v6.11.1.

I think the versioning and changelog got confused. I see the changes
from [1] which was already v2 are included in this one.

[1] https://lore.kernel.org/all/5cf2c0a5-7a99-4294-b316-eee07896ddf6@xxxxxxxxxxxx/T/#u

In any case, I'll reiterate my Ack as this looks like an improvement
for the common case.

Acked-by: Marco Elver <elver@xxxxxxxxxx>

Thanks,
-- Marco

Next message: Krzysztof Kozlowski: "Re: [PATCH v3 1/2] dt-bindings: i2c: snps,designware-i2c: declare bus capacitance and clk freq optimized"
Previous message: James Clark: "Re: [PATCHSET 0/8] perf tools: Do not set attr.exclude_guest by default (v4)"
Next in thread: Mathieu Desnoyers: "Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]