Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width

From: Dietmar Eggemann

Date: Wed Apr 22 2026 - 09:33:37 EST

On 07.04.26 20:16, Shrikanth Hegde wrote:
>
>
> On 4/7/26 8:08 PM, Dietmar Eggemann wrote:
>> On 07.04.26 08:39, Zhang Qiao wrote:
>>> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
>>> waker/wakee relationships and to disable wake_affine() for those cases.
>>>
>>> On SMT systems, sd_llc_size counts logical CPUs rather than physical
>>> cores. This inflates the wake_wide() threshold, allowing wake_affine()
>>> to pack more tasks into one LLC domain than the actual compute capacity
>>> of its physical cores can sustain. The resulting SMT interference may
>>> cost more than the cache-locality benefit wake_affine() intends to gain.
>>>
>>> Scale the factor by the SMT width of the current CPU so that it
>>> approximates the number of independent physical cores in the LLC domain,
>>> making wake_wide() more likely to kick in before SMT interference
>>> becomes significant. On non-SMT systems the SMT width is 1 and behaviour
>>> is unchanged.
>>>
>>> Signed-off-by: Zhang Qiao <zhangqiao22@xxxxxxxxxx>
>>> ---
>>> kernel/sched/fair.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index f07df8987a5ef..4896582c6e904 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
>>>       unsigned int slave = p->wakee_flips;
>>>       int factor = __this_cpu_read(sd_llc_size);
>>> +    /* Scale factor to physical-core count to account for SMT
>>> interference. */
>>> +    if (sched_smt_active())
>>> +        factor = DIV_ROUND_UP(factor,
>>> +                cpumask_weight(cpu_smt_mask(smp_processor_id())));
>>> +
>>>       if (master < slave)
>>>           swap(master, slave);
>>>       if (slave < factor || master < slave * factor)
>>
>> I assume not a lot of people care since this needs:
>
> wake_affine machinery needs SD_WAKE_AFFINE. No?

Yes, the potential call to wake_affine() and forcing 'sd = NULL' but
that's not forcing a wakeup (WF_TTWU) into the slow path
(sched_balance_find_dst_cpu()), which IMHO is the actual wake wide.

You need 'sd != NULL' which can only be set by (1)for a wakeup:

for_each_domain(cpu, tmp)
...
if (tmp->flags & sd_flag) <-- '(1) SD_BALANCE_WAKE == WF_TTWU'
sd = tmp;

and since SD_BALANCE_WAKE is never set per default in sd_init()
[kernel/sched/topology.c] I wonder how they achieved this wide (i.e. not
affine MC for this_cpu or prev_cpu) wakeup?

By default, we only select wide for WF_FORK and WF_EXEC.

Or do they just want to force 'wake_wide(p) == 1' into sis(..., new_cpu
= prev_cpu) ?

[...]