Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling

From: Chen, Yu C

Date: Wed Dec 24 2025 - 02:52:07 EST

On 12/24/2025 11:28 AM, Yangyu Chen wrote:

On 24 Dec 2025, at 00:44, Yangyu Chen <cyy@xxxxxxxxxxxx> wrote:

On 23 Dec 2025, at 20:12, Yangyu Chen <cyy@xxxxxxxxxxxx> wrote:

On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote:

From: Chen Yu <yu.c.chen@xxxxxxxxx>

Introduce a set of debugfs knobs to control the enabling of
and parameters for cache-aware load balancing.

(1) llc_enabled
llc_enabled acts as the primary switch - users can toggle it to
enable or disable cache aware load balancing.

(2) llc_aggr_tolerance
With sched_cache enabled, the scheduler uses a process's RSS as a
proxy for its LLC footprint to determine if aggregating tasks on the
preferred LLC could cause cache contention. If RSS exceeds the LLC
size, aggregation is skipped. Some workloads with large RSS but small
actual memory footprints may still benefit from aggregation. Since
the kernel cannot efficiently track per-task cache usage (resctrl is
user-space only), userspace can provide a more accurate hint.

Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
users control how strictly RSS limits aggregation. Values range from
0 to 100:

- 0: Cache-aware scheduling is disabled.
- 1: Strict; tasks with RSS larger than LLC size are skipped.
- 100: Aggressive; tasks are aggregated regardless of RSS.

Hi Chen Yu and Tim Chen,

Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).

I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].

In addition, I have investigated why this happens. And finally I
realized that's because that workload observed 35596 kB RssAnon on
my EPYC Milan Machine, slightly exceeding the LLC size (32M). I
have tested it on an EPYC Genoa cloud server with the correct core
/ cache hierarchy in ACPI table, that shows 31700 kB RssAnon, thus
fitting in LLC. I have no idea why my result shows higher RssAnon,
since they both run Debian Trixie with the exact same kernel and
same executable. But it reminds me we should have a userspace API
for that.

In addition, during profiling the verilator, I found that if scheduled
to SMTs, it will result in poor performance. Thus, I think we should
separate the control for rss size with the SMT scale.

Thanks for the investigation. Could you elaborate a little more about
scheduled to SMTs? Do you mean, if every CPU(SMT) in the LLC has 1 running
task, then the performance is impacted? I thought we have
exceed_llc_nr() to check the smt to avoid this?

It's notable that rss size is not the actual memory footprint. It
would be better if we could measure the l2_miss event or l3_miss
event to measure the l3 hit rate. Just for future work.

Yes, in user space, we can collect PMUs events/memory bandwidth via
resctrl to decide whether to set task attributes.

I'm willing to provide a patch for such a prctl. But I'm busy these
days, maybe I can have the time to do that after one week.

Sure. We haven't yet decided which interface we can leverage.
Also, Qais is working on QOS interface[1] - maybe we can build
on his work.

[1] https://lore.kernel.org/all/20240820163512.1096301-11-qyousef@xxxxxxxxxxx/

thanks,
Chenyu