Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width

From: Dietmar Eggemann

Date: Tue Apr 07 2026 - 10:43:32 EST


On 07.04.26 08:39, Zhang Qiao wrote:
> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
> waker/wakee relationships and to disable wake_affine() for those cases.
>
> On SMT systems, sd_llc_size counts logical CPUs rather than physical
> cores. This inflates the wake_wide() threshold, allowing wake_affine()
> to pack more tasks into one LLC domain than the actual compute capacity
> of its physical cores can sustain. The resulting SMT interference may
> cost more than the cache-locality benefit wake_affine() intends to gain.
>
> Scale the factor by the SMT width of the current CPU so that it
> approximates the number of independent physical cores in the LLC domain,
> making wake_wide() more likely to kick in before SMT interference
> becomes significant. On non-SMT systems the SMT width is 1 and behaviour
> is unchanged.
>
> Signed-off-by: Zhang Qiao <zhangqiao22@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f07df8987a5ef..4896582c6e904 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
> unsigned int slave = p->wakee_flips;
> int factor = __this_cpu_read(sd_llc_size);
>
> + /* Scale factor to physical-core count to account for SMT interference. */
> + if (sched_smt_active())
> + factor = DIV_ROUND_UP(factor,
> + cpumask_weight(cpu_smt_mask(smp_processor_id())));
> +
> if (master < slave)
> swap(master, slave);
> if (slave < factor || master < slave * factor)

I assume not a lot of people care since this needs:

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d55..596c5d590532 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1691,7 +1691,7 @@ sd_init(struct sched_domain_topology_level *tl,
.flags = 1*SD_BALANCE_NEWIDLE
| 1*SD_BALANCE_EXEC
| 1*SD_BALANCE_FORK
- | 0*SD_BALANCE_WAKE
+ | 1*SD_BALANCE_WAKE
| 1*SD_WAKE_AFFINE
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_LLC

And then it's a trade-off between one busy thread per core vs. wakeup cost.