On Wed, Aug 23, 2023 at 02:52:17PM -0400, Mathieu Desnoyers wrote:
On 8/23/23 11:26, Mathieu Desnoyers wrote:
On 8/22/23 07:31, Mathieu Desnoyers wrote:
Introduce cpus_share_l2c to allow querying whether two logical CPUs
share a common L2 cache.
Considering a system like the AMD EPYC 9654 96-Core Processor, the L1
cache has a latency of 4-5 cycles, the L2 cache has a latency of at
least 14ns, whereas the L3 cache has a latency of 50ns [1]. Compared to
this, I measured the RAM accesses to a latency around 120ns on my
system [2]. So L3 really is only 2.4x faster than RAM accesses.
Therefore, with this relatively slow access speed compared to L2, the
scheduler will benefit from only considering CPUs sharing an L2 cache
for the purpose of using remote runqueue locking rather than queued
wakeups.
So I did some more benchmarking to figure out whether the reason for
this speedup is the latency delta between L2 and L3, or is due to the
number of hw threads contending on the rq locks.
I tried to force grouping of those "skip ttwu queue" groups by a subset
of the LLC id, basically by taking the LLC id and adding the cpu number
modulo N, where N is chosen based on my machine topology.
The end result is that I have similar numbers for groups of 1, 2, 4 HW
threads (which use rq locks and skip queued ttwu within the group).
Starting with group of size 8, the performance starts to degrade.
So I wonder: do machines with more than 4 HW threads per L2 cache exist?
If it's the case, there we should think about grouping not only by L2
cache, but also sub-divide this group so the number of hw threads per
group is at most 4.
Here are my results with the hackbench test-case:
Group cpus by 16 hw threads:
Time: 49s
- group cpus by 8 hw threads: (llc_id + cpu modulo 2)
Time: 39s
- group cpus by 4 hw threads: (llc_id + cpu modulo 4)
Time: 34s
- group cpus by 2 hw threads: (llc_id + cpu modulo 8)
(expect same as L2 grouping on this machine)
Time: 34s
- group cpus by 1 hw threads: (cpu)
Time: 33s
One more interesting data point: I tried modifying the grouping
so that I would explicitly group by hw threads which sit in different
L3, and even on different NUMA nodes for some
(group id = cpu_id % 192). This is expected to generate really _bad_
cache locality for the runqueue locks within a group.
The result for these groups of 3 HW threads is about 33s with the
hackbench benchmark, which seems to confirm that the cause of the
speedup is reduction of the contention on the rq locks by making the
groups smaller, and therefore reducing the likelihood of contention for the
rq locks, rather than by improving cache locality from L3 to L2.
In addition to reduced rq lock contention, I think another reason this
improves performance is because it reduced task migration. Not sure if
it is the case on your test system, but on my test machine(Intel SPR),
task migration number dropped.
Hackbench on Intel SPR(2sockets/112cores/224threads) test summary:
- performance improved for all three cases; the more tasks(groups), the
more performance gain;
- task migrations dropped with this series for nr_group=20 and 32
according to 'perf stat'. migration number didn't drop for nr_group=10
but the two update functions' cost dropped which means fewer access to
tg->load_avg and thus, fewer task migrations. This is contradictory
and I can not explain yet;
- rq lock contention dropped for all three cases and it dropped the most
under more overloaded case: nr_group=32.
It's not clear to me why this series can reduce task migrations. I doubt
it has something to do with more wakelist style wakeup becasue for this
test machine, only a single core with two SMT threads share L2 so more
wakeups are through wakelist. In wakelist style wakeup, the target rq's
ttwu_pending is set and that will make the target cpu as !idle_cpu();
This is faster than grabbing the target rq's lock and then increase
target rq's nr_running or set target rq's curr to something else than
idle. So wakelist style wakeup can make target cpu appear as non idle
faster, but I can't connect this with reduced migration yet, I just feel
this might be the reason why task migration reduced.
Below are detailed test data.
Base: 6.5-rc1.
rq_spin%: The percent of raw_spin_rq_lock_nested() as reported by
perf/events=cycles:pp
migration: cpu-migrations reported by "perf stat -a -- sleep 5"
The cmdline used is:
hackbench -g $nr_group -f 20 --pipe --threads -l 480000 -s 100
nr_group=10:
time rq_spin% update_cfs_group% update_load_avg% migration
base 46s 1.32% 20.06% 10.78% 10.227 K/sec
this_series 37s 0.57% 15.08% 7.05% 10.722 K/sec
nr_group=20:
time rq_spin% update_cfs_group% update_load_avg% migration
base 69s 2.57% 19.68% 10.74% 12.098 K/sec
this_series 41s 0.62% 12.11% 5.78% 8.617 K/sec
nr_group=32:
time rq_spin% update_cfs_group% update_load_avg% migration
base 192s±25% 15.12% 25.83% 9.33% 12.141 K/sec
this_series 71s 0.47% 10.98% 4.58% 8.198 K/sec
I also tested applying my "ratelimit update to tg->load_avg" patch and
the test summary is:
- performance improved noticeably for nr_group=20 and slightly for
nr_group=10 case; nr_group=32's performance is roughly the same.
- task migrations dropped for all three cases; nr_group=20 saw the
biggest drop.
- rq lock contention dropped for all three cases and again, nr_group=32
saw the biggest drop.
Below are detailed data.
Base: peter's sched/core branch with my "ratelimit" patch.
this_series: apply this patchset on top of base.
nr_group=10:
time rq_spin% update_cfs_group% update_load_avg% migration
base 36s 0.55% 0.46% 1.43% 15.034 K/sec
this_series 35s 0.56% 0.52% 1.53% 13.751 K/sec
nr_group=20:
time rq_spin% update_cfs_group% update_load_avg% migration
base 47s 1.28% 0.73% 2.33% 21.217 K/sec
this_series 42s 0.60% 0.69% 1.69% 14.130 K/sec
nr_group=32:
time rq_spin% update_cfs_group% update_load_avg% migration
base 70s 2.38% 0.60% 2.19% 17.855 K/sec
this_series 70s 0.58% 0.63% 1.77% 12.331 K/sec
Thanks,
Aaron