Re: [RFC][PATCH] sched: Cache aware load-balancing

From: K Prateek Nayak
Date: Wed Mar 26 2025 - 02:18:32 EST


Hello Peter, Chenyu,

On 3/26/2025 12:14 AM, Peter Zijlstra wrote:
On Tue, Mar 25, 2025 at 11:19:52PM +0800, Chen, Yu C wrote:

Hi Peter,

Thanks for sending this out,

On 3/25/2025 8:09 PM, Peter Zijlstra wrote:
Hi all,

One of the many things on the eternal todo list has been finishing the
below hackery.

It is an attempt at modelling cache affinity -- and while the patch
really only targets LLC, it could very well be extended to also apply to
clusters (L2). Specifically any case of multiple cache domains inside a
node.

Anyway, I wrote this about a year ago, and I mentioned this at the
recent OSPM conf where Gautham and Prateek expressed interest in playing
with this code.

So here goes, very rough and largely unproven code ahead :-)

It applies to current tip/master, but I know it will fail the __percpu
validation that sits in -next, although that shouldn't be terribly hard
to fix up.

As is, it only computes a CPU inside the LLC that has the highest recent
runtime, this CPU is then used in the wake-up path to steer towards this
LLC and in task_hot() to limit migrations away from it.

More elaborate things could be done, notably there is an XXX in there
somewhere about finding the best LLC inside a NODE (interaction with
NUMA_BALANCING).


Besides the control provided by CONFIG_SCHED_CACHE, could we also introduce
sched_feat(SCHED_CACHE) to manage this feature, facilitating dynamic
adjustments? Similarly we can also introduce other sched_feats for load
balancing and NUMA balancing for fine-grain control.

We can do all sorts, but the very first thing is determining if this is
worth it at all. Because if we can't make this work at all, all those
things are a waste of time.

This patch is not meant to be merged, it is meant for testing and
development. We need to first make it actually improve workloads. If it
then turns out it regresses workloads (likely, things always do), then
we can look at how to best do that.


Thank you for sharing the patch and the initial review from Chenyu
pointing to issues that need fixing. I'll try to take a good look at it
this week and see if I can improve some trivial benchmarks that regress
currently with RFC as is.

In its current form I think this suffers from the same problem as
SIS_NODE where wakeups redirect to same set of CPUs and a good deal of
additional work is being done without any benefit.

I'll leave the results from my initial testing on the 3rd Generation
EPYC platform below and will evaluate what is making the benchmarks
unhappy. I'll return with more data when some of these benchmarks
are not as unhappy as they are now.

Thank you both for the RFC and the initial feedback. Following are
the initial results for the RFC as is:

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) sched_cache[pct imp](CV)
1-groups 1.00 [ -0.00](10.12) 1.01 [ -0.89]( 2.84)
2-groups 1.00 [ -0.00]( 6.92) 1.83 [-83.15]( 1.61)
4-groups 1.00 [ -0.00]( 3.14) 3.00 [-200.21]( 3.13)
8-groups 1.00 [ -0.00]( 1.35) 3.44 [-243.75]( 2.20)
16-groups 1.00 [ -0.00]( 1.32) 2.59 [-158.98]( 4.29)


==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) sched_cache[pct imp](CV)
1 1.00 [ 0.00]( 0.43) 0.96 [ -3.54]( 0.56)
2 1.00 [ 0.00]( 0.58) 0.99 [ -1.32]( 1.40)
4 1.00 [ 0.00]( 0.54) 0.98 [ -2.34]( 0.78)
8 1.00 [ 0.00]( 0.49) 0.96 [ -3.91]( 0.54)
16 1.00 [ 0.00]( 1.06) 0.97 [ -3.22]( 1.82)
32 1.00 [ 0.00]( 1.27) 0.95 [ -4.74]( 2.05)
64 1.00 [ 0.00]( 1.54) 0.93 [ -6.65]( 0.63)
128 1.00 [ 0.00]( 0.38) 0.93 [ -6.91]( 1.18)
256 1.00 [ 0.00]( 1.85) 0.99 [ -0.50]( 1.34)
512 1.00 [ 0.00]( 0.31) 0.98 [ -2.47]( 0.14)
1024 1.00 [ 0.00]( 0.19) 0.97 [ -3.06]( 0.39)


==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) sched_cache[pct imp](CV)
Copy 1.00 [ 0.00](11.31) 0.34 [-65.89](72.77)
Scale 1.00 [ 0.00]( 6.62) 0.32 [-68.09](72.49)
Add 1.00 [ 0.00]( 7.06) 0.34 [-65.56](70.56)
Triad 1.00 [ 0.00]( 8.91) 0.34 [-66.47](72.70)


==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) sched_cache[pct imp](CV)
Copy 1.00 [ 0.00]( 2.01) 0.83 [-16.96](24.55)
Scale 1.00 [ 0.00]( 1.49) 0.79 [-21.40](24.10)
Add 1.00 [ 0.00]( 2.67) 0.79 [-21.33](25.39)
Triad 1.00 [ 0.00]( 2.19) 0.81 [-19.19](25.55)


==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) sched_cache[pct imp](CV)
1-clients 1.00 [ 0.00]( 1.43) 0.98 [ -2.22]( 0.26)
2-clients 1.00 [ 0.00]( 1.02) 0.97 [ -2.55]( 0.89)
4-clients 1.00 [ 0.00]( 0.83) 0.98 [ -2.27]( 0.46)
8-clients 1.00 [ 0.00]( 0.73) 0.98 [ -2.45]( 0.80)
16-clients 1.00 [ 0.00]( 0.97) 0.97 [ -2.90]( 0.88)
32-clients 1.00 [ 0.00]( 0.88) 0.95 [ -5.29]( 1.69)
64-clients 1.00 [ 0.00]( 1.49) 0.91 [ -8.70]( 1.95)
128-clients 1.00 [ 0.00]( 1.05) 0.92 [ -8.39]( 4.25)
256-clients 1.00 [ 0.00]( 3.85) 0.92 [ -8.33]( 2.45)
512-clients 1.00 [ 0.00](59.63) 0.92 [ -7.83](51.19)


==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sched_cache[pct imp](CV)
1 1.00 [ -0.00]( 6.67) 0.38 [ 62.22] ( 5.88)
2 1.00 [ -0.00](10.18) 0.43 [ 56.52] ( 2.94)
4 1.00 [ -0.00]( 4.49) 0.60 [ 40.43] ( 5.52)
8 1.00 [ -0.00]( 6.68) 113.96 [-11296.23] (12.91)
16 1.00 [ -0.00]( 1.87) 359.34 [-35834.43] (20.02)
32 1.00 [ -0.00]( 4.01) 217.67 [-21667.03] ( 5.48)
64 1.00 [ -0.00]( 3.21) 97.43 [-9643.02] ( 4.61)
128 1.00 [ -0.00](44.13) 41.36 [-4036.10] ( 6.92)
256 1.00 [ -0.00](14.46) 2.69 [-169.31] ( 1.86)
512 1.00 [ -0.00]( 1.95) 1.89 [-89.22] ( 2.24)


==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sched_cache[pct imp](CV)
1 1.00 [ 0.00]( 0.46) 0.96 [ -4.14]( 0.00)
2 1.00 [ 0.00]( 0.15) 0.95 [ -5.27]( 2.29)
4 1.00 [ 0.00]( 0.15) 0.88 [-12.01]( 0.46)
8 1.00 [ 0.00]( 0.15) 0.55 [-45.47]( 1.23)
16 1.00 [ 0.00]( 0.00) 0.54 [-45.62]( 0.50)
32 1.00 [ 0.00]( 3.40) 0.63 [-37.48]( 6.37)
64 1.00 [ 0.00]( 7.09) 0.67 [-32.73]( 0.59)
128 1.00 [ 0.00]( 0.00) 0.99 [ -0.76]( 0.34)
256 1.00 [ 0.00]( 1.12) 1.06 [ 6.32]( 1.55)
512 1.00 [ 0.00]( 0.22) 1.06 [ 6.08]( 0.92)


==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sched_cache[pct imp](CV)
1 1.00 [ -0.00](19.72) 0.85 [ 15.38] ( 8.13)
2 1.00 [ -0.00](15.96) 1.09 [ -9.09] (18.20)
4 1.00 [ -0.00]( 3.87) 1.00 [ -0.00] ( 0.00)
8 1.00 [ -0.00]( 8.15) 118.17 [-11716.67] ( 0.58)
16 1.00 [ -0.00]( 3.87) 146.62 [-14561.54] ( 4.64)
32 1.00 [ -0.00](12.99) 141.60 [-14060.00] ( 5.64)
64 1.00 [ -0.00]( 6.20) 78.62 [-7762.50] ( 1.79)
128 1.00 [ -0.00]( 0.96) 11.36 [-1036.08] ( 3.41)
256 1.00 [ -0.00]( 2.76) 1.11 [-11.22] ( 3.28)
512 1.00 [ -0.00]( 0.20) 1.21 [-20.81] ( 0.91)


==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sched_cache[pct imp](CV)
1 1.00 [ -0.00]( 1.07) 1.11 [-10.66] ( 2.76)
2 1.00 [ -0.00]( 0.14) 1.20 [-20.40] ( 1.73)
4 1.00 [ -0.00]( 1.39) 2.04 [-104.20] ( 0.96)
8 1.00 [ -0.00]( 0.36) 3.94 [-294.20] ( 2.85)
16 1.00 [ -0.00]( 1.18) 4.56 [-356.16] ( 1.19)
32 1.00 [ -0.00]( 8.42) 3.02 [-201.67] ( 8.93)
64 1.00 [ -0.00]( 4.85) 1.51 [-51.38] ( 0.80)
128 1.00 [ -0.00]( 0.28) 1.83 [-82.77] ( 1.21)
256 1.00 [ -0.00](10.52) 1.43 [-43.11] (10.67)
512 1.00 [ -0.00]( 0.69) 1.25 [-24.96] ( 6.24)


==================================================================
Test : Various longer running benchmarks
Units : %diff in throughput reported
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmarks: %diff
ycsb-cassandra -10.70%
ycsb-mongodb -13.66%

deathstarbench-1x 13.87%
deathstarbench-2x 1.70%
deathstarbench-3x -8.44%
deathstarbench-6x -3.12%

hammerdb+mysql 16VU -33.50%
hammerdb+mysql 64VU -33.22%

---

I'm planning on taking hackbench and schbench as two extreme cases for
throughput and tail latency and later look at Stream from a "high
bandwidth, don't consolidate" standpoint. I hope once those cases
aren't as much in the reds, the larger benchmarks will be happier too.

--
Thanks and Regards,
Prateek