Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

From: Vern Hao

Date: Sun Dec 21 2025 - 21:19:44 EST



On 2025/12/19 11:14, K Prateek Nayak wrote:
Hello Vern,

On 12/18/2025 3:12 PM, Vern Hao wrote:
On 2025/12/18 16:32, Chen, Yu C wrote:
On 12/18/2025 11:59 AM, Vern Hao wrote:
On 2025/12/4 07:07, Tim Chen wrote:
From: Chen Yu <yu.c.chen@xxxxxxxxx>

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its RSS (anonymous and shared pages) to the size of the LLC. If RSS
exceeds the LLC size, skip cache-aware scheduling.
Restricting RSS prevents many applications from benefiting from this optimization. I believe this restriction should be lifted. For memory- intensive workloads, the optimization may simply yield no gains, but it certainly shouldn't make performance worse. We need to further refine this logic.
Memory-intensive workloads may trigger performance regressions when
memory bandwidth(from L3 cache to memory controller) is saturated due
RSS size and bandwidth saturation are not necessarily linked, In my view, the optimization should be robust enough that it doesn't cause a noticeable drop in performance, no matter how large the RSS is.
Easier said than done. I agree RSS size is not a clear indication of
bandwidth saturation. With NUMA Balancing enabled, we can use the
hinting faults to estimate the working set and make decisions but for
systems that do not have NUMA, short of programming some performance
counters, there is no real way to estimate the working set.
I see the challenge, but the reality is that many production workloads have large memory footprints and deserve to see performance gains as well. In my testing with Chen Yu on STREAM, it's intriguing that the performance is fine without |llc_enable| but drops significantly once it's turned on.I sincerely hope this situation can be optimized; otherwise, we won't be able to utilize these optimizations in large-memory scenarios.

Hinting faults are known to cause overheads so enabling them without
NUMA can cause noticeable overheads with no real benefits.

We need to have a more profound discussion on this.
What do you have in mind?
I am wondering if we could address this through alternative approaches, such as reducing the migration frequency or preventing excessive task stacking within a single LLC. Of course, defining the right metrics to evaluate these conditions remains a significant challenge.

From where I stand, having the RSS based bailout for now won't make
things worse for these tasks with huge memory reserves and when we can
all agree on some generic method to estimate the working set of a task,
we can always add it into exceed_llc_capacity().