Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated

From: Chen, Yu C
Date: Thu Apr 24 2025 - 10:12:43 EST


Hi Madadi,

On 4/24/2025 5:22 PM, Madadi Vineeth Reddy wrote:
Hi Chen Yu,

On 21/04/25 08:55, Chen Yu wrote:
It is found that when the process's preferred LLC gets saturated by too many
threads, task contention is very frequent and causes performance regression.

Save the per LLC statistics calculated by periodic load balance. The statistics
include the average utilization and the average number of runnable tasks.
The task wakeup path for cache aware scheduling manipulates these statistics
to inhibit cache aware scheduling to avoid performance regression. When either
the average utilization of the preferred LLC has reached 25%, or the average
number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
wakeup is disabled. Only when the process has more threads than the LLC weight
will this restriction be enabled.

Running schbench via mmtests on a Xeon platform, which has 2 sockets, each socket
has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via BIOS,
so there are 2 "LLCs" in 1 NUMA node.

compare-mmtests.pl --directory work/log --benchmark schbench --names baseline,sched_cache
baselin sched_cach
baseline sched_cache
Lat 50.0th-qrtle-1 6.00 ( 0.00%) 6.00 ( 0.00%)
Lat 90.0th-qrtle-1 10.00 ( 0.00%) 9.00 ( 10.00%)
Lat 99.0th-qrtle-1 29.00 ( 0.00%) 13.00 ( 55.17%)
Lat 99.9th-qrtle-1 35.00 ( 0.00%) 21.00 ( 40.00%)
Lat 20.0th-qrtle-1 266.00 ( 0.00%) 266.00 ( 0.00%)
Lat 50.0th-qrtle-2 8.00 ( 0.00%) 6.00 ( 25.00%)
Lat 90.0th-qrtle-2 10.00 ( 0.00%) 10.00 ( 0.00%)
Lat 99.0th-qrtle-2 19.00 ( 0.00%) 18.00 ( 5.26%)
Lat 99.9th-qrtle-2 27.00 ( 0.00%) 29.00 ( -7.41%)
Lat 20.0th-qrtle-2 533.00 ( 0.00%) 507.00 ( 4.88%)
Lat 50.0th-qrtle-4 6.00 ( 0.00%) 5.00 ( 16.67%)
Lat 90.0th-qrtle-4 8.00 ( 0.00%) 5.00 ( 37.50%)
Lat 99.0th-qrtle-4 14.00 ( 0.00%) 9.00 ( 35.71%)
Lat 99.9th-qrtle-4 22.00 ( 0.00%) 14.00 ( 36.36%)
Lat 20.0th-qrtle-4 1070.00 ( 0.00%) 995.00 ( 7.01%)
Lat 50.0th-qrtle-8 5.00 ( 0.00%) 5.00 ( 0.00%)
Lat 90.0th-qrtle-8 7.00 ( 0.00%) 5.00 ( 28.57%)
Lat 99.0th-qrtle-8 12.00 ( 0.00%) 11.00 ( 8.33%)
Lat 99.9th-qrtle-8 19.00 ( 0.00%) 16.00 ( 15.79%)
Lat 20.0th-qrtle-8 2140.00 ( 0.00%) 2140.00 ( 0.00%)
Lat 50.0th-qrtle-16 6.00 ( 0.00%) 5.00 ( 16.67%)
Lat 90.0th-qrtle-16 7.00 ( 0.00%) 5.00 ( 28.57%)
Lat 99.0th-qrtle-16 12.00 ( 0.00%) 10.00 ( 16.67%)
Lat 99.9th-qrtle-16 17.00 ( 0.00%) 14.00 ( 17.65%)
Lat 20.0th-qrtle-16 4296.00 ( 0.00%) 4200.00 ( 2.23%)
Lat 50.0th-qrtle-32 6.00 ( 0.00%) 5.00 ( 16.67%)
Lat 90.0th-qrtle-32 8.00 ( 0.00%) 6.00 ( 25.00%)
Lat 99.0th-qrtle-32 12.00 ( 0.00%) 10.00 ( 16.67%)
Lat 99.9th-qrtle-32 17.00 ( 0.00%) 14.00 ( 17.65%)
Lat 20.0th-qrtle-32 8496.00 ( 0.00%) 8528.00 ( -0.38%)
Lat 50.0th-qrtle-64 6.00 ( 0.00%) 5.00 ( 16.67%)
Lat 90.0th-qrtle-64 8.00 ( 0.00%) 8.00 ( 0.00%)
Lat 99.0th-qrtle-64 12.00 ( 0.00%) 12.00 ( 0.00%)
Lat 99.9th-qrtle-64 17.00 ( 0.00%) 17.00 ( 0.00%)
Lat 20.0th-qrtle-64 17120.00 ( 0.00%) 17120.00 ( 0.00%)
Lat 50.0th-qrtle-128 7.00 ( 0.00%) 7.00 ( 0.00%)
Lat 90.0th-qrtle-128 9.00 ( 0.00%) 9.00 ( 0.00%)
Lat 99.0th-qrtle-128 13.00 ( 0.00%) 14.00 ( -7.69%)
Lat 99.9th-qrtle-128 20.00 ( 0.00%) 20.00 ( 0.00%)
Lat 20.0th-qrtle-128 31776.00 ( 0.00%) 30496.00 ( 4.03%)
Lat 50.0th-qrtle-239 9.00 ( 0.00%) 9.00 ( 0.00%)
Lat 90.0th-qrtle-239 14.00 ( 0.00%) 18.00 ( -28.57%)
Lat 99.0th-qrtle-239 43.00 ( 0.00%) 56.00 ( -30.23%)
Lat 99.9th-qrtle-239 106.00 ( 0.00%) 483.00 (-355.66%)
Lat 20.0th-qrtle-239 30176.00 ( 0.00%) 29984.00 ( 0.64%)

We can see overall latency improvement and some throughput degradation
when the system gets saturated.

Also, we run schbench (old version) on an EPYC 7543 system, which has
4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:

case load baseline(std%) compare%( std%)
normal 4-mthreads-1-workers 1.00 ( 6.47) +9.02 ( 4.68)
normal 4-mthreads-2-workers 1.00 ( 3.25) +28.03 ( 8.76)
normal 4-mthreads-4-workers 1.00 ( 6.67) -4.32 ( 2.58)
normal 4-mthreads-8-workers 1.00 ( 2.38) +1.27 ( 2.41)
normal 4-mthreads-16-workers 1.00 ( 5.61) -8.48 ( 4.39)
normal 4-mthreads-31-workers 1.00 ( 9.31) -0.22 ( 9.77)

When the LLC is underloaded, the latency improvement is observed. When the LLC
gets saturated, we observe some degradation.


[..snip..]

+static bool valid_target_cpu(int cpu, struct task_struct *p)
+{
+ int nr_running, llc_weight;
+ unsigned long util, llc_cap;
+
+ if (!get_llc_stats(cpu, &nr_running, &llc_weight,
+ &util))
+ return false;
+
+ llc_cap = llc_weight * SCHED_CAPACITY_SCALE;
+
+ /*
+ * If this process has many threads, be careful to avoid
+ * task stacking on the preferred LLC, by checking the system's
+ * utilization and runnable tasks. Otherwise, if this
+ * process does not have many threads, honor the cache
+ * aware wakeup.
+ */
+ if (get_nr_threads(p) < llc_weight)
+ return true;

IIUC, there might be scenarios were llc might be already overloaded with
threads of other process. In that case, we will be returning true for p in
above condition and don't check the below conditions. Shouldn't we check
the below two conditions either way?

The reason why get_nr_threads() was used is that we don't know if the following threshold is suitable for different workloads. We chose 25% and 33% because we found that it worked well for workload A, but was too low for workload B. Workload B requires the cache-aware scheduling to be enabled in any case, and the number of threads in B is smaller than the llc_weight. Therefore, we use the above check to meet the requirements of B. What you said is correct. We can remove the above checks on nr_thread and make the combination of utilization and nr_running a mandatory check, and then conduct further tuning.>
Tested this patch with real life workload Daytrader, didn't see any regression.

Good to know the regression is gone.

It spawns lot of threads and is CPU intensive. So, I think it's not impacted
due to the below conditions.

Also, in schbench numbers provided by you, there is a degradation in saturated
case. Is it due to the overhead in computing the preferred llc which is not
being used due to below conditions?

Yes, the overhead of preferred LLC calculation could be one part, and we also suspect that the degradation might be tied to the task migrations. We still observed more task migrations than the baseline, even when the system was saturated (in theory, after 25% is exceeded, we should fallback to the generic task wakeup path). We haven't dug into that yet, and we can conduct an investigation in the following days.

thanks,
Chenyu>
Thanks,
Madadi Vineeth Reddy

+
+ /*
+ * Check if it exceeded 25% of average utiliazation,
+ * or if it exceeded 33% of CPUs. This is a magic number
+ * that did not cause heavy cache contention on Xeon or
+ * Zen.
+ */
+ if (util * 4 >= llc_cap)
+ return false;
+
+ if (nr_running * 3 >= llc_weight)
+ return false;
+
+ return true;
+}
+

[..snip..]