Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width

From: Shrikanth Hegde

Date: Tue Apr 07 2026 - 14:17:41 EST

On 4/7/26 8:08 PM, Dietmar Eggemann wrote:

On 07.04.26 08:39, Zhang Qiao wrote:

wake_wide() uses sd_llc_size as the spreading threshold to detect wide
waker/wakee relationships and to disable wake_affine() for those cases.

On SMT systems, sd_llc_size counts logical CPUs rather than physical
cores. This inflates the wake_wide() threshold, allowing wake_affine()
to pack more tasks into one LLC domain than the actual compute capacity
of its physical cores can sustain. The resulting SMT interference may
cost more than the cache-locality benefit wake_affine() intends to gain.

Scale the factor by the SMT width of the current CPU so that it
approximates the number of independent physical cores in the LLC domain,
making wake_wide() more likely to kick in before SMT interference
becomes significant. On non-SMT systems the SMT width is 1 and behaviour
is unchanged.

Signed-off-by: Zhang Qiao <zhangqiao22@xxxxxxxxxx>
---
kernel/sched/fair.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f07df8987a5ef..4896582c6e904 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
unsigned int slave = p->wakee_flips;
int factor = __this_cpu_read(sd_llc_size);
+ /* Scale factor to physical-core count to account for SMT interference. */
+ if (sched_smt_active())
+ factor = DIV_ROUND_UP(factor,
+ cpumask_weight(cpu_smt_mask(smp_processor_id())));
+
if (master < slave)
swap(master, slave);
if (slave < factor || master < slave * factor)

I assume not a lot of people care since this needs:

wake_affine machinery needs SD_WAKE_AFFINE. No?

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d55..596c5d590532 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1691,7 +1691,7 @@ sd_init(struct sched_domain_topology_level *tl,
.flags = 1*SD_BALANCE_NEWIDLE
| 1*SD_BALANCE_EXEC
| 1*SD_BALANCE_FORK
- | 0*SD_BALANCE_WAKE
+ | 1*SD_BALANCE_WAKE
| 1*SD_WAKE_AFFINE
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_LLC

And then it's a trade-off between one busy thread per core vs. wakeup cost.