Re: [RFC][PATCH] sched: Cache aware load-balancing

From: Chen, Yu C
Date: Wed Mar 26 2025 - 05:15:57 EST



Hi Prateek,

On 3/26/2025 2:18 PM, K Prateek Nayak wrote:
Hello Peter, Chenyu,

On 3/26/2025 12:14 AM, Peter Zijlstra wrote:
On Tue, Mar 25, 2025 at 11:19:52PM +0800, Chen, Yu C wrote:

Hi Peter,

Thanks for sending this out,

On 3/25/2025 8:09 PM, Peter Zijlstra wrote:
Hi all,

One of the many things on the eternal todo list has been finishing the
below hackery.

It is an attempt at modelling cache affinity -- and while the patch
really only targets LLC, it could very well be extended to also apply to
clusters (L2). Specifically any case of multiple cache domains inside a
node.

Anyway, I wrote this about a year ago, and I mentioned this at the
recent OSPM conf where Gautham and Prateek expressed interest in playing
with this code.

So here goes, very rough and largely unproven code ahead :-)

It applies to current tip/master, but I know it will fail the __percpu
validation that sits in -next, although that shouldn't be terribly hard
to fix up.

As is, it only computes a CPU inside the LLC that has the highest recent
runtime, this CPU is then used in the wake-up path to steer towards this
LLC and in task_hot() to limit migrations away from it.

More elaborate things could be done, notably there is an XXX in there
somewhere about finding the best LLC inside a NODE (interaction with
NUMA_BALANCING).


Besides the control provided by CONFIG_SCHED_CACHE, could we also introduce
sched_feat(SCHED_CACHE) to manage this feature, facilitating dynamic
adjustments? Similarly we can also introduce other sched_feats for load
balancing and NUMA balancing for fine-grain control.

We can do all sorts, but the very first thing is determining if this is
worth it at all. Because if we can't make this work at all, all those
things are a waste of time.

This patch is not meant to be merged, it is meant for testing and
development. We need to first make it actually improve workloads. If it
then turns out it regresses workloads (likely, things always do), then
we can look at how to best do that.


Thank you for sharing the patch and the initial review from Chenyu
pointing to issues that need fixing. I'll try to take a good look at it
this week and see if I can improve some trivial benchmarks that regress
currently with RFC as is.

In its current form I think this suffers from the same problem as
SIS_NODE where wakeups redirect to same set of CPUs and a good deal of
additional work is being done without any benefit.

I'll leave the results from my initial testing on the 3rd Generation
EPYC platform below and will evaluate what is making the benchmarks
unhappy. I'll return with more data when some of these benchmarks
are not as unhappy as they are now.

Thank you both for the RFC and the initial feedback. Following are
the initial results for the RFC as is:

  ==================================================================
  Test          : hackbench
  Units         : Normalized time in seconds
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  Case:           tip[pct imp](CV)      sched_cache[pct imp](CV)
   1-groups     1.00 [ -0.00](10.12)     1.01 [ -0.89]( 2.84)
   2-groups     1.00 [ -0.00]( 6.92)     1.83 [-83.15]( 1.61)
   4-groups     1.00 [ -0.00]( 3.14)     3.00 [-200.21]( 3.13)
   8-groups     1.00 [ -0.00]( 1.35)     3.44 [-243.75]( 2.20)
  16-groups     1.00 [ -0.00]( 1.32)     2.59 [-158.98]( 4.29)


  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)     sched_cache[pct imp](CV)
      1     1.00 [  0.00]( 0.43)     0.96 [ -3.54]( 0.56)
      2     1.00 [  0.00]( 0.58)     0.99 [ -1.32]( 1.40)
      4     1.00 [  0.00]( 0.54)     0.98 [ -2.34]( 0.78)
      8     1.00 [  0.00]( 0.49)     0.96 [ -3.91]( 0.54)
     16     1.00 [  0.00]( 1.06)     0.97 [ -3.22]( 1.82)
     32     1.00 [  0.00]( 1.27)     0.95 [ -4.74]( 2.05)
     64     1.00 [  0.00]( 1.54)     0.93 [ -6.65]( 0.63)
    128     1.00 [  0.00]( 0.38)     0.93 [ -6.91]( 1.18)
    256     1.00 [  0.00]( 1.85)     0.99 [ -0.50]( 1.34)
    512     1.00 [  0.00]( 0.31)     0.98 [ -2.47]( 0.14)
   1024     1.00 [  0.00]( 0.19)     0.97 [ -3.06]( 0.39)


  ==================================================================
  Test          : stream-10
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)     sched_cache[pct imp](CV)
   Copy     1.00 [  0.00](11.31)     0.34 [-65.89](72.77)
  Scale     1.00 [  0.00]( 6.62)     0.32 [-68.09](72.49)
    Add     1.00 [  0.00]( 7.06)     0.34 [-65.56](70.56)
  Triad     1.00 [  0.00]( 8.91)     0.34 [-66.47](72.70)


  ==================================================================
  Test          : stream-100
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)     sched_cache[pct imp](CV)
   Copy     1.00 [  0.00]( 2.01)     0.83 [-16.96](24.55)
  Scale     1.00 [  0.00]( 1.49)     0.79 [-21.40](24.10)
    Add     1.00 [  0.00]( 2.67)     0.79 [-21.33](25.39)
  Triad     1.00 [  0.00]( 2.19)     0.81 [-19.19](25.55)


  ==================================================================
  Test          : netperf
  Units         : Normalized Througput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:         tip[pct imp](CV)     sched_cache[pct imp](CV)
   1-clients     1.00 [  0.00]( 1.43)     0.98 [ -2.22]( 0.26)
   2-clients     1.00 [  0.00]( 1.02)     0.97 [ -2.55]( 0.89)
   4-clients     1.00 [  0.00]( 0.83)     0.98 [ -2.27]( 0.46)
   8-clients     1.00 [  0.00]( 0.73)     0.98 [ -2.45]( 0.80)
  16-clients     1.00 [  0.00]( 0.97)     0.97 [ -2.90]( 0.88)
  32-clients     1.00 [  0.00]( 0.88)     0.95 [ -5.29]( 1.69)
  64-clients     1.00 [  0.00]( 1.49)     0.91 [ -8.70]( 1.95)
  128-clients    1.00 [  0.00]( 1.05)     0.92 [ -8.39]( 4.25)
  256-clients    1.00 [  0.00]( 3.85)     0.92 [ -8.33]( 2.45)
  512-clients    1.00 [  0.00](59.63)     0.92 [ -7.83](51.19)


  ==================================================================
  Test          : schbench
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)       sched_cache[pct imp](CV)
    1     1.00 [ -0.00]( 6.67)      0.38 [ 62.22]    ( 5.88)
    2     1.00 [ -0.00](10.18)      0.43 [ 56.52]    ( 2.94)
    4     1.00 [ -0.00]( 4.49)      0.60 [ 40.43]    ( 5.52)
    8     1.00 [ -0.00]( 6.68)    113.96 [-11296.23] (12.91)
   16     1.00 [ -0.00]( 1.87)    359.34 [-35834.43] (20.02)
   32     1.00 [ -0.00]( 4.01)    217.67 [-21667.03] ( 5.48)
   64     1.00 [ -0.00]( 3.21)     97.43 [-9643.02]  ( 4.61)
  128     1.00 [ -0.00](44.13)     41.36 [-4036.10]  ( 6.92)
  256     1.00 [ -0.00](14.46)      2.69 [-169.31]   ( 1.86)
  512     1.00 [ -0.00]( 1.95)      1.89 [-89.22]    ( 2.24)


  ==================================================================
  Test          : new-schbench-requests-per-second
  Units         : Normalized Requests per second
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)      sched_cache[pct imp](CV)
    1     1.00 [  0.00]( 0.46)     0.96 [ -4.14]( 0.00)
    2     1.00 [  0.00]( 0.15)     0.95 [ -5.27]( 2.29)
    4     1.00 [  0.00]( 0.15)     0.88 [-12.01]( 0.46)
    8     1.00 [  0.00]( 0.15)     0.55 [-45.47]( 1.23)
   16     1.00 [  0.00]( 0.00)     0.54 [-45.62]( 0.50)
   32     1.00 [  0.00]( 3.40)     0.63 [-37.48]( 6.37)
   64     1.00 [  0.00]( 7.09)     0.67 [-32.73]( 0.59)
  128     1.00 [  0.00]( 0.00)     0.99 [ -0.76]( 0.34)
  256     1.00 [  0.00]( 1.12)     1.06 [  6.32]( 1.55)
  512     1.00 [  0.00]( 0.22)     1.06 [  6.08]( 0.92)


  ==================================================================
  Test          : new-schbench-wakeup-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)       sched_cache[pct imp](CV)
    1     1.00 [ -0.00](19.72)     0.85  [ 15.38]    ( 8.13)
    2     1.00 [ -0.00](15.96)     1.09  [ -9.09]    (18.20)
    4     1.00 [ -0.00]( 3.87)     1.00  [ -0.00]    ( 0.00)
    8     1.00 [ -0.00]( 8.15)    118.17 [-11716.67] ( 0.58)
   16     1.00 [ -0.00]( 3.87)    146.62 [-14561.54] ( 4.64)
   32     1.00 [ -0.00](12.99)    141.60 [-14060.00] ( 5.64)
   64     1.00 [ -0.00]( 6.20)    78.62  [-7762.50]  ( 1.79)
  128     1.00 [ -0.00]( 0.96)    11.36  [-1036.08]  ( 3.41)
  256     1.00 [ -0.00]( 2.76)     1.11  [-11.22]    ( 3.28)
  512     1.00 [ -0.00]( 0.20)     1.21  [-20.81]    ( 0.91)


  ==================================================================
  Test          : new-schbench-request-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)      sched_cache[pct imp](CV)
    1     1.00 [ -0.00]( 1.07)     1.11 [-10.66]  ( 2.76)
    2     1.00 [ -0.00]( 0.14)     1.20 [-20.40]  ( 1.73)
    4     1.00 [ -0.00]( 1.39)     2.04 [-104.20] ( 0.96)
    8     1.00 [ -0.00]( 0.36)     3.94 [-294.20] ( 2.85)
   16     1.00 [ -0.00]( 1.18)     4.56 [-356.16] ( 1.19)
   32     1.00 [ -0.00]( 8.42)     3.02 [-201.67] ( 8.93)
   64     1.00 [ -0.00]( 4.85)     1.51 [-51.38]  ( 0.80)
  128     1.00 [ -0.00]( 0.28)     1.83 [-82.77]  ( 1.21)
  256     1.00 [ -0.00](10.52)     1.43 [-43.11]  (10.67)
  512     1.00 [ -0.00]( 0.69)     1.25 [-24.96]  ( 6.24)


  ==================================================================
  Test          : Various longer running benchmarks
  Units         : %diff in throughput reported
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  Benchmarks:                 %diff
  ycsb-cassandra             -10.70%
  ycsb-mongodb               -13.66%

  deathstarbench-1x           13.87%
  deathstarbench-2x            1.70%
  deathstarbench-3x           -8.44%
  deathstarbench-6x           -3.12%

  hammerdb+mysql 16VU        -33.50%
  hammerdb+mysql 64VU        -33.22%

---

I'm planning on taking hackbench and schbench as two extreme cases for
throughput and tail latency and later look at Stream from a "high
bandwidth, don't consolidate" standpoint. I hope once those cases
aren't as much in the reds, the larger benchmarks will be happier too.


Thanks for running the test. I think hackbenc/schbench would be the good benchmarks to start with. I remember that you and Gautham mentioned that schbench prefers to be aggregated in a single LLC in LPC2021 or 2022. I ran a schbench test using mmtests on a Xeon server which has 4 NUMA nodes. Each node has 80 cores (with SMT disabled). The numa=off option was appended to the boot commandline, so there are 4 "LLCs" within each node.


BASELIN SCHED_CACH
BASELINE SCHED_CACHE
Lat 50.0th-qrtle-1 8.00 ( 0.00%) 5.00 ( 37.50%)
Lat 90.0th-qrtle-1 9.00 ( 0.00%) 5.00 ( 44.44%)
Lat 99.0th-qrtle-1 13.00 ( 0.00%) 10.00 ( 23.08%)
Lat 99.9th-qrtle-1 21.00 ( 0.00%) 19.00 ( 9.52%)*
Lat 20.0th-qrtle-1 404.00 ( 0.00%) 411.00 ( -1.73%)
Lat 50.0th-qrtle-2 8.00 ( 0.00%) 5.00 ( 37.50%)
Lat 90.0th-qrtle-2 11.00 ( 0.00%) 8.00 ( 27.27%)
Lat 99.0th-qrtle-2 16.00 ( 0.00%) 11.00 ( 31.25%)
Lat 99.9th-qrtle-2 27.00 ( 0.00%) 17.00 ( 37.04%)*
Lat 20.0th-qrtle-2 823.00 ( 0.00%) 821.00 ( 0.24%)
Lat 50.0th-qrtle-4 10.00 ( 0.00%) 5.00 ( 50.00%)
Lat 90.0th-qrtle-4 12.00 ( 0.00%) 6.00 ( 50.00%)
Lat 99.0th-qrtle-4 18.00 ( 0.00%) 9.00 ( 50.00%)
Lat 99.9th-qrtle-4 29.00 ( 0.00%) 16.00 ( 44.83%)*
Lat 20.0th-qrtle-4 1650.00 ( 0.00%) 1598.00 ( 3.15%)
Lat 50.0th-qrtle-8 9.00 ( 0.00%) 4.00 ( 55.56%)
Lat 90.0th-qrtle-8 11.00 ( 0.00%) 6.00 ( 45.45%)
Lat 99.0th-qrtle-8 16.00 ( 0.00%) 9.00 ( 43.75%)
Lat 99.9th-qrtle-8 28.00 ( 0.00%) 188.00 (-571.43%)*
Lat 20.0th-qrtle-8 3316.00 ( 0.00%) 3100.00 ( 6.51%)
Lat 50.0th-qrtle-16 10.00 ( 0.00%) 5.00 ( 50.00%)
Lat 90.0th-qrtle-16 13.00 ( 0.00%) 7.00 ( 46.15%)
Lat 99.0th-qrtle-16 19.00 ( 0.00%) 12.00 ( 36.84%)
Lat 99.9th-qrtle-16 28.00 ( 0.00%) 2034.00 (-7164.29%)*
Lat 20.0th-qrtle-16 6632.00 ( 0.00%) 5800.00 ( 12.55%)
Lat 50.0th-qrtle-32 7.00 ( 0.00%) 12.00 ( -71.43%)
Lat 90.0th-qrtle-32 10.00 ( 0.00%) 62.00 (-520.00%)
Lat 99.0th-qrtle-32 14.00 ( 0.00%) 841.00 (-5907.14%)
Lat 99.9th-qrtle-32 23.00 ( 0.00%) 1862.00 (-7995.65%)*
Lat 20.0th-qrtle-32 13264.00 ( 0.00%) 10608.00 ( 20.02%)
Lat 50.0th-qrtle-64 7.00 ( 0.00%) 64.00 (-814.29%)
Lat 90.0th-qrtle-64 12.00 ( 0.00%) 709.00 (-5808.33%)
Lat 99.0th-qrtle-64 18.00 ( 0.00%) 2260.00 (-12455.56%)
Lat 99.9th-qrtle-64 26.00 ( 0.00%) 3572.00 (-13638.46%)*
Lat 20.0th-qrtle-64 26528.00 ( 0.00%) 14064.00 ( 46.98%)
Lat 50.0th-qrtle-128 7.00 ( 0.00%) 115.00 (-1542.86%)
Lat 90.0th-qrtle-128 11.00 ( 0.00%) 1626.00 (-14681.82%)
Lat 99.0th-qrtle-128 17.00 ( 0.00%) 4472.00 (-26205.88%)
Lat 99.9th-qrtle-128 27.00 ( 0.00%) 8088.00 (-29855.56%)*
Lat 20.0th-qrtle-128 53184.00 ( 0.00%) 17312.00 ( 67.45%)
Lat 50.0th-qrtle-256 172.00 ( 0.00%) 255.00 ( -48.26%)
Lat 90.0th-qrtle-256 2092.00 ( 0.00%) 1482.00 ( 29.16%)
Lat 99.0th-qrtle-256 2684.00 ( 0.00%) 3148.00 ( -17.29%)
Lat 99.9th-qrtle-256 4504.00 ( 0.00%) 6008.00 ( -33.39%)*
Lat 20.0th-qrtle-256 53056.00 ( 0.00%) 48064.00 ( 9.41%)
Lat 50.0th-qrtle-319 375.00 ( 0.00%) 478.00 ( -27.47%)
Lat 90.0th-qrtle-319 2420.00 ( 0.00%) 2244.00 ( 7.27%)
Lat 99.0th-qrtle-319 4552.00 ( 0.00%) 4456.00 ( 2.11%)
Lat 99.9th-qrtle-319 6072.00 ( 0.00%) 7656.00 ( -26.09%)*
Lat 20.0th-qrtle-319 47936.00 ( 0.00%) 47808.00 ( 0.27%)

We can see that, when the system is under-load, the 99.9th wakeup
latency improves. But when the system gets busier, say, from thread
number 8 to 319, the wakeup latency suffers.

The following change could mitigate the issue, which is intended to avoid task migration/stacking:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cddd67100a91..a492463aed71 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8801,6 +8801,7 @@ static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int
static int select_cache_cpu(struct task_struct *p, int prev_cpu)
{
struct mm_struct *mm = p->mm;
+ struct sched_domain *sd;
int cpu;

if (!sched_feat(SCHED_CACHE))
@@ -8813,6 +8814,8 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
if (cpu < 0)
return prev_cpu;

+ if (cpus_share_cache(prev_cpu, cpu))
+ return prev_cpu;

if (static_branch_likely(&sched_numa_balancing) &&
__migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
@@ -8822,6 +8825,10 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
return prev_cpu;
}

+ sd = rcu_dereference(per_cpu(sd_llc, cpu));
+ if (likely(sd))
+ return cpumask_any(sched_domain_span(sd));
+
return cpu;
}

BASELINE_s SCHED_CACHE_s
BASELINE_sc SCHED_CACHE_sc
Lat 50.0th-qrtle-1 5.00 ( 0.00%) 5.00 ( 0.00%)
Lat 90.0th-qrtle-1 8.00 ( 0.00%) 5.00 ( 37.50%)
Lat 99.0th-qrtle-1 10.00 ( 0.00%) 10.00 ( 0.00%)
Lat 99.9th-qrtle-1 20.00 ( 0.00%) 20.00 ( 0.00%)*
Lat 20.0th-qrtle-1 409.00 ( 0.00%) 406.00 ( 0.73%)
Lat 50.0th-qrtle-2 8.00 ( 0.00%) 4.00 ( 50.00%)
Lat 90.0th-qrtle-2 11.00 ( 0.00%) 5.00 ( 54.55%)
Lat 99.0th-qrtle-2 16.00 ( 0.00%) 11.00 ( 31.25%)
Lat 99.9th-qrtle-2 29.00 ( 0.00%) 16.00 ( 44.83%)*
Lat 20.0th-qrtle-2 819.00 ( 0.00%) 825.00 ( -0.73%)
Lat 50.0th-qrtle-4 10.00 ( 0.00%) 4.00 ( 60.00%)
Lat 90.0th-qrtle-4 12.00 ( 0.00%) 4.00 ( 66.67%)
Lat 99.0th-qrtle-4 18.00 ( 0.00%) 6.00 ( 66.67%)
Lat 99.9th-qrtle-4 30.00 ( 0.00%) 15.00 ( 50.00%)*
Lat 20.0th-qrtle-4 1658.00 ( 0.00%) 1670.00 ( -0.72%)
Lat 50.0th-qrtle-8 9.00 ( 0.00%) 3.00 ( 66.67%)
Lat 90.0th-qrtle-8 11.00 ( 0.00%) 4.00 ( 63.64%)
Lat 99.0th-qrtle-8 16.00 ( 0.00%) 6.00 ( 62.50%)
Lat 99.9th-qrtle-8 29.00 ( 0.00%) 13.00 ( 55.17%)*
Lat 20.0th-qrtle-8 3308.00 ( 0.00%) 3340.00 ( -0.97%)
Lat 50.0th-qrtle-16 9.00 ( 0.00%) 4.00 ( 55.56%)
Lat 90.0th-qrtle-16 12.00 ( 0.00%) 4.00 ( 66.67%)
Lat 99.0th-qrtle-16 18.00 ( 0.00%) 6.00 ( 66.67%)
Lat 99.9th-qrtle-16 31.00 ( 0.00%) 12.00 ( 61.29%)*
Lat 20.0th-qrtle-16 6616.00 ( 0.00%) 6680.00 ( -0.97%)
Lat 50.0th-qrtle-32 8.00 ( 0.00%) 4.00 ( 50.00%)
Lat 90.0th-qrtle-32 11.00 ( 0.00%) 5.00 ( 54.55%)
Lat 99.0th-qrtle-32 17.00 ( 0.00%) 8.00 ( 52.94%)
Lat 99.9th-qrtle-32 27.00 ( 0.00%) 11.00 ( 59.26%)*
Lat 20.0th-qrtle-32 13296.00 ( 0.00%) 13328.00 ( -0.24%)
Lat 50.0th-qrtle-64 9.00 ( 0.00%) 46.00 (-411.11%)
Lat 90.0th-qrtle-64 14.00 ( 0.00%) 1198.00 (-8457.14%)
Lat 99.0th-qrtle-64 20.00 ( 0.00%) 2252.00 (-11160.00%)
Lat 99.9th-qrtle-64 31.00 ( 0.00%) 2844.00 (-9074.19%)*
Lat 20.0th-qrtle-64 26528.00 ( 0.00%) 15504.00 ( 41.56%)
Lat 50.0th-qrtle-128 7.00 ( 0.00%) 26.00 (-271.43%)
Lat 90.0th-qrtle-128 11.00 ( 0.00%) 2244.00 (-20300.00%)
Lat 99.0th-qrtle-128 17.00 ( 0.00%) 4488.00 (-26300.00%)
Lat 99.9th-qrtle-128 27.00 ( 0.00%) 5752.00 (-21203.70%)*
Lat 20.0th-qrtle-128 53184.00 ( 0.00%) 24544.00 ( 53.85%)
Lat 50.0th-qrtle-256 172.00 ( 0.00%) 135.00 ( 21.51%)
Lat 90.0th-qrtle-256 2084.00 ( 0.00%) 2022.00 ( 2.98%)
Lat 99.0th-qrtle-256 2780.00 ( 0.00%) 3908.00 ( -40.58%)
Lat 99.9th-qrtle-256 4536.00 ( 0.00%) 5832.00 ( -28.57%)*
Lat 20.0th-qrtle-256 53568.00 ( 0.00%) 51904.00 ( 3.11%)
Lat 50.0th-qrtle-319 369.00 ( 0.00%) 358.00 ( 2.98%)
Lat 90.0th-qrtle-319 2428.00 ( 0.00%) 2436.00 ( -0.33%)
Lat 99.0th-qrtle-319 4552.00 ( 0.00%) 4664.00 ( -2.46%)
Lat 99.9th-qrtle-319 6104.00 ( 0.00%) 6632.00 ( -8.65%)*
Lat 20.0th-qrtle-319 48192.00 ( 0.00%) 48832.00 ( -1.33%)


We can see wakeup latency improvement in a wider range when running different number of threads. But there is still regression starting from thread number 64 - maybe the benefit of LLC locality is offset by the task stacking on 1 LLC. One possible direction I'm thinking of is that, we can get a snapshot of LLC status in load balance, check if the LLC is overloaded, if yes, do not enable this LLC aggregation during task wakeup - but do it in the load balancer, which is less frequent.

thanks,
Chenyu