Re: [PATCH] sched/fair: Prefer cache-hot prev_cpu for wakeup

From: Shubhang

Date: Thu Oct 30 2025 - 12:35:30 EST


The system is an 80 core Ampere Altra with a two-level
sched domain topology. The MC domain contains all 80 cores.

I agree that placing the condition earlier in `select_idle_sibling()` aligns better with convention. I will move the check (EAS Aware) to the top of the function and submit a v2 patch.

Best,
Shubhang Kaushik

On Thu, 30 Oct 2025, Dietmar Eggemann wrote:

On 18.10.25 01:00, Shubhang Kaushik via B4 Relay wrote:
From: Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>

Modify the wakeup path in `select_task_rq_fair()` to prioritize cache
locality for waking tasks. The previous fast path always attempted to
find an idle sibling, even if the task's prev CPU was not truly busy.

The original problem was that under some circumstances, this could lead
to unnecessary task migrations away from a cache-hot core, even when
the task's prev CPU was a suitable candidate. The scheduler's internal
mechanism `cpu_overutilized()` provide an evaluation of CPU load.

To address this, the wakeup heuristic is updated to check the status of
the task's `prev_cpu` first:
- If the `prev_cpu` is not overutilized (as determined by
`cpu_overutilized()`, via PELT), the task is woken up on
its previous CPU. This leverages cache locality and avoids
a potentially unnecessary migration.
- If the `prev_cpu` is considered busy or overutilized, the scheduler
falls back to the existing behavior of searching for an idle sibling.

How does you sched domain topology look like? How many CPUs do you have
in your MC domain?


Signed-off-by: Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>
---
This patch optimizes the scheduler's wakeup path to prioritize cache
locality by keeping a task on its previous CPU if it is not overutilized,
falling back to a sibling search only when necessary.
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d6bbe616953f530f7a02bb619537c..bb0d28d7d9872642cb5a4076caeb3ac9d8fe7bcd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8618,7 +8618,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
} else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
- new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+
+ /*
+ * Avoid wakeup on an overutilized CPU.
+ * If the previous CPU is not overloaded, retain the same for cache locality.
+ * Otherwise, search for an idle sibling.
+ */
+ if (!cpu_overutilized(prev_cpu))
+ new_cpu = prev_cpu;

IMHO, special conditions like this one are normally coded at the
beginning of select_idle_sibling().

[...]