Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

From: Chen, Yu C

Date: Fri Dec 19 2025 - 07:56:06 EST

On 12/19/2025 11:14 AM, K Prateek Nayak wrote:

Hello Vern,

On 12/18/2025 3:12 PM, Vern Hao wrote:

On 2025/12/18 16:32, Chen, Yu C wrote:

On 12/18/2025 11:59 AM, Vern Hao wrote:

On 2025/12/4 07:07, Tim Chen wrote:

From: Chen Yu <yu.c.chen@xxxxxxxxx>

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its RSS (anonymous and shared pages) to the size of the LLC. If RSS
exceeds the LLC size, skip cache-aware scheduling.

Restricting RSS prevents many applications from benefiting from this optimization. I believe this restriction should be lifted. For memory- intensive workloads, the optimization may simply yield no gains, but it certainly shouldn't make performance worse. We need to further refine this logic.

Memory-intensive workloads may trigger performance regressions when
memory bandwidth(from L3 cache to memory controller) is saturated due

RSS size and bandwidth saturation are not necessarily linked, In my view, the optimization should be robust enough that it doesn't cause a noticeable drop in performance, no matter how large the RSS is.

Easier said than done. I agree RSS size is not a clear indication of
bandwidth saturation. With NUMA Balancing enabled, we can use the
hinting faults to estimate the working set and make decisions but for
systems that do not have NUMA, short of programming some performance
counters, there is no real way to estimate the working set.

Hinting faults are known to cause overheads so enabling them without
NUMA can cause noticeable overheads with no real benefits.

We need to have a more profound discussion on this.

What do you have in mind?

From where I stand, having the RSS based bailout for now won't make
things worse for these tasks with huge memory reserves and when we can
all agree on some generic method to estimate the working set of a task,
we can always add it into exceed_llc_capacity().

Prateek, thanks very much for the practical callouts - using RSS seems to be
the best trade-off we can go with for now. Vern, I get your point about the
concern between RSS and actual memory footprint. However, detecting the working
set doesn’t seem to be accurate or generic in kernel space - even with
NUMA fault statistics sampling. One reliable way I can think of to
detect the working set is in user space, via resctrl (Intel RDT, AMD QoS,
Arm MPAM). So maybe we can leverage that information to implement fine-grained
control on a per-process or per-task basis later.

thanks,
Chenyu