[PATCH RESEND 0/9] sched: cpumask: improve on cpumask_local_spread() locality

From: Yury Norov
Date: Fri Jan 20 2023 - 23:25:01 EST


cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This has significant performance implications on NUMA machines, for example
when using NUMA-aware allocated memory together with NUMA-aware IRQ
affinity hints.

Performance tests from patch 8 of this series for mellanox network
driver show:

TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121

+-------------------------+-----------+------------------+------------------+
| | BW (Gbps) | TX side CPU util | RX side CPU util |
+-------------------------+-----------+------------------+------------------+
| Baseline | 52.3 | 6.4 % | 17.9 % |
+-------------------------+-----------+------------------+------------------+
| Applied on TX side only | 52.6 | 5.2 % | 18.5 % |
+-------------------------+-----------+------------------+------------------+
| Applied on RX side only | 94.9 | 11.9 % | 27.2 % |
+-------------------------+-----------+------------------+------------------+
| Applied on both sides | 95.1 | 8.4 % | 27.3 % |
+-------------------------+-----------+------------------+------------------+

Bottleneck in RX side is released, reached linerate (~1.8x speedup).
~30% less cpu util on TX.

This series was supposed to be included in v6.2, but that didn't happen. It
spent enough in -next without any issues, so I hope we'll finally see it
in v6.3.

I believe, the best way would be moving it with scheduler patches, but I'm
OK to try again with bitmap branch as well.

Tariq Toukan (1):
net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
hints

Valentin Schneider (2):
sched/topology: Introduce sched_numa_hop_mask()
sched/topology: Introduce for_each_numa_hop_mask()

Yury Norov (6):
lib/find: introduce find_nth_and_andnot_bit
cpumask: introduce cpumask_nth_and_andnot
sched: add sched_numa_find_nth_cpu()
cpumask: improve on cpumask_local_spread() locality
lib/cpumask: reorganize cpumask_local_spread() logic
lib/cpumask: update comment for cpumask_local_spread()

drivers/net/ethernet/mellanox/mlx5/core/eq.c | 18 +++-
include/linux/cpumask.h | 20 +++++
include/linux/find.h | 33 +++++++
include/linux/topology.h | 33 +++++++
kernel/sched/topology.c | 90 ++++++++++++++++++++
lib/cpumask.c | 52 ++++++-----
lib/find_bit.c | 9 ++
7 files changed, 230 insertions(+), 25 deletions(-)

--
2.34.1