Re: [RFC PATCH v4 26/28] sched: Do not enable cache aware scheduling for process with large RSS

From: Chen, Yu C

Date: Fri Sep 26 2025 - 10:31:16 EST

On 9/26/2025 4:48 PM, Adam Li wrote:

Hi Chen Yu,

Thanks for your work.
I tested the patch set on AmpereOne CPU with 192 cores.

With CONFIG_SCHED_CLUSTER enabled, and with certain firmware setting,
every eight cores will be grouped into a 'cluster' schedule domain
with 'SD_SHARE_LLC' flag.
However, these eight cores do *no* share L3 cache in this setup.

In exceed_llc_capacity() of this patch, we have 'llc = l3_leaf->size',
'llc' will be zero if there is *no* L3 cache.
So exceed_llc_capacity() will be true and 'Cache Aware Scheduling' will
not work. Please see details bellow.

I read in patch 01/28 "sched: Cache aware load-balancing" [1],
Peter mentioned:
"It is an attempt at modelling cache affinity -- and while the patch
really only targets LLC, it could very well be extended to also apply to
clusters (L2). Specifically any case of multiple cache domains inside a
node".

Do you have any idea how we can apply the cache aware load-balancing
to clusters? The cores in the cluster may share L2 or LLC tags.

My understanding is that if there is no L3 cache, then the L2 becomes
the LLC. We don’t need to modify the code specific to L2-aware scheduling
because the L2 is now the last-level cache (LLC). However, as you observed,
there are some cases that need to be taken care of. For example, Patch 8
needs to be fixed so that it does not always retrieve the cache size of
L3.

On the other hand, if the system has both an L2 cluster and an L3, the
code might need to be changed if we want to perform L2 cache aggregation
rather than L3 cache aggregation.

[1]: https://lore.kernel.org/all/9157186cf9e3fd541f62c637579ff736b3704c51.1754712565.git.tim.c.chen@xxxxxxxxxxxxxxx/

On 8/9/2025 1:08 PM, Chen Yu wrote:

+
+ l3_leaf = this_cpu_ci->info_list + 3;
+ llc = l3_leaf->size;
+

For some arm64 CPU topology, cores can be grouped into 'cluster'.
Cores in a cluster may not share L3 cache. 'l3_leaf->size'
will be 0.

It looks we assume LLC is L3 cache?

Right, but LLC should not always be L3, need a fix here.

Can we skip exceed_llc_capacity() check if no L3?

I thought we should return the size of L2 instead, no?

thanks,
Chenyu> Like this draft patch:

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1227,6 +1227,8 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)

l3_leaf = this_cpu_ci->info_list + 3;
llc = l3_leaf->size;
+ if (!llc)
+ return false;