Re: [PATCH v2] sched/fair: Prefer cache locality for EAS wakeup

From: Shubhang Kaushik OS

Date: Fri Nov 14 2025 - 13:27:18 EST

Our current kernel has CONFIG_SCHED_CLUSTER enabled. While it shows 2 NUMA nodes, only node0 is the one containing 0-79 cores.

NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-79
NUMA node1 CPU(s):

I run similar perf testcases along with MySQL and AI workloads.

> IMHO, the scheduler only cares about shared LLC (and shared L2 with
> CONFIG_SCHED_CLUSTER). Can you check:

$ cat /sys/devices/system/cpu/cpu0/cache/index*/{type,shared_cpu_map}
Data
Instruction
Unified
0000,00000000,00000001
0000,00000000,00000001
0000,00000000,00000001

The output confirms that the extra Unified cache entry (1) does not exist in our sysfs view.
Correct, this Altra machine only a 2-CPU MC SD, which results in the small MC cpumask.

[
"cpu78",
{
"MC": "['78-79']",
"PKG": "['0-79']"
}
][
"cpu79",
{
"MC": "['78-79']",
"PKG": "['0-79']"
}
]

Thanks,
Shubhang Kaushik

________________________________________
From: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
Sent: Thursday, November 13, 2025 6:54 AM
To: Shubhang Kaushik OS; Vincent Guittot
Cc: Ingo Molnar; Peter Zijlstra; Juri Lelli; Steven Rostedt; Ben Segall; Mel Gorman; Valentin Schneider; Shubhang Kaushik; Shijie Huang; Frank Wang; Christopher Lameter; Adam Li; linux-kernel@xxxxxxxxxxxxxxx
Subject: Re: [PATCH v2] sched/fair: Prefer cache locality for EAS wakeup

On 13.11.25 01:26, Shubhang Kaushik OS wrote:
>> From your previous answer on v1, I don't think that you use
>> heterogeneous system so eas will not be enabled in your case and even
>> when used find_energy_efficient_cpu() will be called before
>
> I agree that the EAS centric approach in the current patch is misplaced for our homogeneous systems.
>
>> Otherwise you might want to check in wake_affine() where we decide
>> between local cpu and previous cpu which one should be the target.
>> This can have an impact especially if there are not in the same LLC
>
> While wake_affine() modifications seem logical, I see that they cause performance regressions across the board due to the inherent trade-offs in altering that critical initial decision point.

Which testcases are you running on your Altra box? I assume it's a
single NUMA node (80 CPUs).

For us, 'perf bench sched messaging` w/o CONFIG_SCHED_CLUSTER, so only
PKG SD (i.e. sis() only returns prev or this CPU) gives better results
then w/ CONFIG_SCHED_CLUSTER.

> We might need to solve the non-idle fallback within `select_idle_sibling` to ring fence the impact for preserving locality effectively.

IMHO, the scheduler only cares about shared LLC (and shared L2 with
CONFIG_SCHED_CLUSTER). Can you check:

$ cat /sys/devices/system/cpu/cpu0/cache/index*/{type,shared_cpu_map}
Data
Instruction
Unified
Unified <-- (1)
00000000,00000000,00000000,00000000,00000001
00000000,00000000,00000000,00000000,00000001
00000000,00000000,00000000,00000000,00000001
CPU mask > 00000000,00000000,00000000,00000000,00000001 <-- (1)

Does (1) exists? IMHO it doesn't.

I assume your machine is quite unique here. IIRC, you configure 2 CPUs
groups in your ACPI pptt which then form a 2 CPUs cluster_cpumask and
since your core_mask (in cpu_coregrop_mask()) has only 1 CPU, it gets
set to the cluster_cpumask so at the end you have a 2 CPU MC SD and no
CLS SD plus an 80 CPU PKG SD.

This CLS->MC propagation is somehow important since only then you get a
valid 'sd = rcu_dereference(per_cpu(sd_llc, target))' in sis() so you
not just return target (prev or this CPU).
But I can imagine that your MC cpumask is way too small for the SIS_UTIL
based selection of an idle CPU.

[...]