On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote:
Hello Mathieu,
On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 223baf9d17f25e2608dbdff7232c095c1e612268
Gitweb: https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
Author: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
AuthorDate: Thu, 20 Apr 2023 10:55:48 -04:00
Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
sched: Fix performance regression introduced by mm_cid
Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
sysbench regression reported by Aaron Lu.
Keep track of the currently allocated mm_cid for each mm/cpu rather than
freeing them immediately on context switch. This eliminates most atomic
operations when context switching back and forth between threads
belonging to different memory spaces in multi-threaded scenarios (many
processes, each with many threads). The per-mm/per-cpu mm_cid values are
serialized by their respective runqueue locks.
Thread migration is handled by introducing invocation to
sched_mm_cid_migrate_to() (with destination runqueue lock held) in
activate_task() for migrating tasks. If the destination cpu's mm_cid is
unset, and if the source runqueue is not actively using its mm_cid, then
the source cpu's mm_cid is moved to the destination cpu on migration.
Introduce a task-work executed periodically, similarly to NUMA work,
which delays reclaim of cid values when they are unused for a period of
time.
Keep track of the allocation time for each per-cpu cid, and let the task
work clear them when they are observed to be older than
SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
mm_cids which are greater or equal to the Hamming weight of the mm
cidmask to keep concurrency ids compact.
Because we want to ensure the mm_cid converges towards the smaller
values as migrations happen, the prior optimization that was done when
context switching between threads belonging to the same mm is removed,
because it could delay the lazy release of the destination runqueue
mm_cid after it has been replaced by a migration. Removing this prior
optimization is not an issue performance-wise because the introduced
per-mm/per-cpu mm_cid tracking also covers this more specific case.
Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
Reported-by: Aaron Lu <aaron.lu@xxxxxxxxx>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Tested-by: Aaron Lu <aaron.lu@xxxxxxxxx>
Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
I run standard benchmarks as a part of kernel performance regression
testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
I have seen performance regression in hackbench running with threads. When I did
git bisect it pointed to this commit and reverting this commit helps regains
the performance. This regression is not seen with hackbench processes.
Well, *this* commit was supposed to help fix the horrible contention on
cid_lock that was introduced with af7f588d8f73.
Following are the results from 1 Socket 4th generation EPYC
Processor(1 X 96C/192T) configured in NPS1 mode. This regression
becomes more severe as the number of core count increases.
The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is significantly worse.
Threads:
Test: With-mmcid-patch Without-mmcid-patch
1-groups: 5.23 (0.00 pct) 4.61 (+11.85 pct)
2-groups: 4.99 (0.00 pct) 4.72 (+5.41 pct)
4-groups: 5.96 (0.00 pct) 4.87 (+18.28 pct)
8-groups: 6.58 (0.00 pct) 5.44 (+17.32 pct)
16-groups: 11.48 (0.00 pct) 8.07 (+29.70 pct)
I'm really confused, so you're saying that having a process wide
spinlock is better than what this patch does? Or are you testing against
something without mm-cid entirely?