Re: [RFC PATCH] sched/fair: scale wake_wide() threshold by SMT width

From: Dietmar Eggemann

Date: Mon May 11 2026 - 12:01:51 EST

On 29.04.26 04:43, Zhang Qiao wrote:
>
> Hi,
>
> 在 2026/4/22 21:26, Dietmar Eggemann 写道:
>> On 16.04.26 09:41, Zhang Qiao wrote:
>>> Hi Shrikanth,
>>>
>>> 在 2026/4/8 1:58, Shrikanth Hegde 写道:
>>>> Hi.
>>>>
>>>> On 4/7/26 12:09 PM, Zhang Qiao wrote:

[...]

>>> The workload is a producer-consumer model: one producer wakes up ~50
>>> different consumers, with roughly 10+ consumers running concurrently.
>>> The total number of tasks is well below the CPU count.
>>
>> But higher than your MC core count I believe? Otherwise you wouldn't
>> care. I assume you have MC CPU count of 12-24. Do you have more than 2
>> different MCs.
>
> My server has 10 different MCs (LLCs), with each MC containing 8 physical cores
> (16 threads with SMT-2).

Thanks.

>>> In this scenario, load balancing is largely ineffective. Each consumer
>>> spends most of its time sleeping, gets woken by the producer, runs
>>> briefly to process the message, then goes back to sleep. There is
>>> almost no window where a consumer sits on a CPU runqueue in the runnable
>>> state waiting to be pulled. Since load balancing can only migrate
>>> runnable tasks, it simply has no target to act on here.
>>
>> OK, but SD_BALANCE_WAKE is not set by default, nobody would experience a
>
> SD_BALANCE_WAKE was not enabled in my tests.

Right, looks like I mixed up balance flags & fast/slow path with the
wake affine vs. wake wide logic.

>> difference in behaviour on an SMT machine in terms of waking tasks wide,
>> i.e. going through the slow path. Like I tried to explain in the
>> adjacent thread, your wakees would only end up in the slow path in case
>> your sched domains would have SD_BALANCE_WAKE set.>
>> Or do you just want to force wakeups which have wake_wide(p) return 1
>> always into the fast path with 'new_cpu == prev_cpu'? But this wouldn't
>> be wake wide?
>
> The observed improvement comes from suppressing wake_affine() before it
> pulls wakees onto the waker's physical core. In the producer-consumer
> workload, without this patch, consumers are repeatedly affined into the
> waker's LLC and end up co-scheduled on the same physical core's SMT
> siblings. With the patch, wake_wide() fires earlier and wakees are left
> on prev_cpu, resulting in better spread across physical cores.

Makes sense.

You mentioned having ~10+ consumers running concurrently. I’m curious
why select_idle_sibling() isn’t doing a better job of distributing those
tasks across idle cores, even though wakeups are affine to the waker and
its LLC domain. Is this because you only have 8 cores per LLC, combined
with general system noise?