Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width

From: Zhang Qiao

Date: Thu Apr 16 2026 - 03:44:21 EST


Hi Shrikanth,

在 2026/4/8 1:58, Shrikanth Hegde 写道:
> Hi.
>
> On 4/7/26 12:09 PM, Zhang Qiao wrote:
>> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
>> waker/wakee relationships and to disable wake_affine() for those cases.
>>
>> On SMT systems, sd_llc_size counts logical CPUs rather than physical
>> cores. This inflates the wake_wide() threshold, allowing wake_affine()
>> to pack more tasks into one LLC domain than the actual compute capacity
>> of its physical cores can sustain. The resulting SMT interference may
>> cost more than the cache-locality benefit wake_affine() intends to gain.
>>
>
> Isn't load balance to move it out? What does the workload do?

The workload is a producer-consumer model: one producer wakes up ~50
different consumers, with roughly 10+ consumers running concurrently.
The total number of tasks is well below the CPU count.

In this scenario, load balancing is largely ineffective. Each consumer
spends most of its time sleeping, gets woken by the producer, runs
briefly to process the message, then goes back to sleep. There is
almost no window where a consumer sits on a CPU runqueue in the runnable
state waiting to be pulled. Since load balancing can only migrate
runnable tasks, it simply has no target to act on here.

>
>> Scale the factor by the SMT width of the current CPU so that it
>> approximates the number of independent physical cores in the LLC domain,
>> making wake_wide() more likely to kick in before SMT interference
>> becomes significant. On non-SMT systems the SMT width is 1 and behaviour
>> is unchanged.
>>
>
> There are systems where LLC_SIZE == SMT_SIZE. i.e one core in the LLC.
> This would effectively disable wake_affine feature in such systems.
>
> Power10 being a major example.
>
>> Signed-off-by: Zhang Qiao <zhangqiao22@xxxxxxxxxx>
>> ---
>>   kernel/sched/fair.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f07df8987a5ef..4896582c6e904 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7334,6 +7334,11 @@ static int wake_wide(struct task_struct *p)
>>       unsigned int slave = p->wakee_flips;
>>       int factor = __this_cpu_read(sd_llc_size);
>>   +    /* Scale factor to physical-core count to account for SMT interference. */
>> +    if (sched_smt_active())
>> +        factor = DIV_ROUND_UP(factor,
>> +                cpumask_weight(cpu_smt_mask(smp_processor_id())));
>> +
>>       if (master < slave)
>>           swap(master, slave);
>>       if (slave < factor || master < slave * factor)
>
> .