Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width

From: Zhang Qiao

Date: Tue Apr 28 2026 - 22:43:23 EST

Hi,

在 2026/4/22 21:26, Dietmar Eggemann 写道:
> On 16.04.26 09:41, Zhang Qiao wrote:
>> Hi Shrikanth,
>>
>> 在 2026/4/8 1:58, Shrikanth Hegde 写道:
>>> Hi.
>>>
>>> On 4/7/26 12:09 PM, Zhang Qiao wrote:
>>>> wake_wide() uses sd_llc_size as the spreading threshold to detect wide
>>>> waker/wakee relationships and to disable wake_affine() for those cases.
>>>>
>>>> On SMT systems, sd_llc_size counts logical CPUs rather than physical
>>>> cores. This inflates the wake_wide() threshold, allowing wake_affine()
>>>> to pack more tasks into one LLC domain than the actual compute capacity
>>>> of its physical cores can sustain. The resulting SMT interference may
>>>> cost more than the cache-locality benefit wake_affine() intends to gain.
>>>>
>>>
>>> Isn't load balance to move it out? What does the workload do?
>>
>> The workload is a producer-consumer model: one producer wakes up ~50
>> different consumers, with roughly 10+ consumers running concurrently.
>> The total number of tasks is well below the CPU count.
>
> But higher than your MC core count I believe? Otherwise you wouldn't
> care. I assume you have MC CPU count of 12-24. Do you have more than 2
> different MCs.

My server has 10 different MCs (LLCs), with each MC containing 8 physical cores
(16 threads with SMT-2).

>
>> In this scenario, load balancing is largely ineffective. Each consumer
>> spends most of its time sleeping, gets woken by the producer, runs
>> briefly to process the message, then goes back to sleep. There is
>> almost no window where a consumer sits on a CPU runqueue in the runnable
>> state waiting to be pulled. Since load balancing can only migrate
>> runnable tasks, it simply has no target to act on here.
>
> OK, but SD_BALANCE_WAKE is not set by default, nobody would experience a

SD_BALANCE_WAKE was not enabled in my tests.

> difference in behaviour on an SMT machine in terms of waking tasks wide,
> i.e. going through the slow path. Like I tried to explain in the
> adjacent thread, your wakees would only end up in the slow path in case
> your sched domains would have SD_BALANCE_WAKE set.>
> Or do you just want to force wakeups which have wake_wide(p) return 1
> always into the fast path with 'new_cpu == prev_cpu'? But this wouldn't
> be wake wide?

The observed improvement comes from suppressing wake_affine() before it
pulls wakees onto the waker's physical core. In the producer-consumer
workload, without this patch, consumers are repeatedly affined into the
waker's LLC and end up co-scheduled on the same physical core's SMT
siblings. With the patch, wake_wide() fires earlier and wakees are left
on prev_cpu, resulting in better spread across physical cores.

Thanks
Zhang Qiao

>
> [...]
>
> .
>