Re: [Patch v4 17/22] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

From: Chen, Yu C

Date: Fri Apr 10 2026 - 05:00:16 EST


Hi Peter,

On 4/9/2026 8:46 PM, Peter Zijlstra wrote:
On Wed, Apr 01, 2026 at 02:52:29PM -0700, Tim Chen wrote:
From: Chen Yu <yu.c.chen@xxxxxxxxx>

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its RSS (anonymous and shared pages) to the size of the LLC. If RSS
exceeds the LLC size, skip cache-aware scheduling.

Note that RSS is only an approximation of the memory footprint.
By default, the comparison is strict, but a later patch will allow
users to provide a hint to adjust this threshold.

According to the test from Adam, some systems do not have shared L3
but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].

This is pretty terrible. If you want LLC size, add it to the topology
information (and ideally integrate with RDT) and make proportional to
cpumask size, such that if someone cuts the domain in pieces, they get
proportional size etc.


If I understand correctly, do you mean the following:

1.Introduce a generic arch_get_llc_size() as a wrapper
around the existing get_cpu_cacheinfo_level(), which
returns the llc_size. Both the scheduler and RDT can
use arch_get_llc_size().
2. The sched domain stores llc_size in
sd->res_size = llc_size * sd_span / arch_llc_span,
and the cache_aware_scheduler uses sd->res_size for
the comparison.

We will adjust the code accordingly.

Also, if we have NUMA_BALANCING on, that can provide a much better
estimate for the actual size.

Just using RSS seems like a very bad metric here.


Got it. Currently we lack accurate memory footprint metrics in
the kernel. If we support user-provided hints in the future, we
can leverage RDT llc_occupancy metrics(Is it legal to use
RDT's metrics directly in the kernel? It would switch from
MSR-read to MMIO read thus less overhead). For now, let me have
a try how to leverage NUMA fault-in stats. If NUMA balancing
is off, I need to think more on how to avoid over-aggregation for
memory-intensive workloads.

thanks,
Chenyu