On 8/22/23 07:31, Mathieu Desnoyers wrote:
Introduce cpus_share_l2c to allow querying whether two logical CPUs
share a common L2 cache.
Considering a system like the AMD EPYC 9654 96-Core Processor, the L1
cache has a latency of 4-5 cycles, the L2 cache has a latency of at
least 14ns, whereas the L3 cache has a latency of 50ns [1]. Compared to
this, I measured the RAM accesses to a latency around 120ns on my
system [2]. So L3 really is only 2.4x faster than RAM accesses.
Therefore, with this relatively slow access speed compared to L2, the
scheduler will benefit from only considering CPUs sharing an L2 cache
for the purpose of using remote runqueue locking rather than queued
wakeups.
So I did some more benchmarking to figure out whether the reason for this speedup is the latency delta between L2 and L3, or is due to the number of hw threads contending on the rq locks.
I tried to force grouping of those "skip ttwu queue" groups by a subset of the LLC id, basically by taking the LLC id and adding the cpu number modulo N, where N is chosen based on my machine topology.
The end result is that I have similar numbers for groups of 1, 2, 4 HW threads (which use rq locks and skip queued ttwu within the group). Starting with group of size 8, the performance starts to degrade.
So I wonder: do machines with more than 4 HW threads per L2 cache exist? If it's the case, there we should think about grouping not only by L2 cache, but also sub-divide this group so the number of hw threads per group is at most 4.
Here are my results with the hackbench test-case:
Group cpus by 16 hw threads:
Time: 49s
- group cpus by 8 hw threads: (llc_id + cpu modulo 2)
Time: 39s
- group cpus by 4 hw threads: (llc_id + cpu modulo 4)
Time: 34s
- group cpus by 2 hw threads: (llc_id + cpu modulo 8)
(expect same as L2 grouping on this machine)
Time: 34s
- group cpus by 1 hw threads: (cpu)
Time: 33s