Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()
From: Christian Loehle
Date: Wed Aug 20 2025 - 09:57:48 EST
On 8/19/25 16:32, Chen, Yu C wrote:
> On 8/18/2025 9:24 PM, Christian Loehle wrote:
>> On 8/18/25 13:47, Chengming Zhou wrote:
>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
>>> on an assumption that the wakee task can pick a cpu running sched_idle
>>> task and preempt it to run, faster than picking an idle cpu to preempt
>>> the idle task.
>>>
>>> This assumption is correct, but it also brings some problems:
>>>
>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
>>> which is already running sched_idle task, instead of utilizing a real
>>> idle cpu, so work conservation is somewhat broken.
>>>
>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
>>> sched_idle group running. Look a simple example below.
>>>
>>> root
>>> / \
>>> kubepods system
>>> / \
>>> burstable besteffort
>>> (cpu.idle == 1)
>>>
>>> When a sched_idle cpu is just running tasks from besteffort group,
>>> sched_idle_cpu() will return true in this case, but this cpu pick
>>> is bad for wakee task from system group. Because the system group
>>> has lower weight than kubepods, work conservation is somewhat
>>> broken too.
>>>
>>> In a nutshell, sched_idle_cpu() should consider the wakee task group's
>>> relationship with sched_idle tasks running on the cpu.
>>>
>>> Obviously, it's hard to do so. This patch chooses the simple approach
>>> to remove all sched_idle_cpu() considerations in select_task_rq_fair()
>>> to bring back work conservation in these cases.
>>
>> OTOH sched_idle_cpu() CPUs are guaranteed to not be in an idle state and
>> potentially already have DVFS on some higher level...
>>
> Is it because the schedutil governor considers the utilization
> of SCHED_IDLE, thus causing schedutil to request a higher
> frequency?
For intel_pstate active (HWP and !HWP) the same issue should persist, no?
>
> The commit 3c29e651e16d ("sched/fair: Fall back to sched-idle
> CPU if an idle CPU isn't found") mentions that choosing a CPU
> running a SCHED_IDLE task can avoid waking a CPU from a deep
> sleep state.
>
> If this is the case, can we say that if an administrator sets
> the cpufreq governor to "performance" and disables deep idle
> states, an idle CPU would be more preferable than a CPU running
> a SCHED_IDLE task? On the other hand, if
> per_cpu(cpufreq_update_util_data, cpu) is NULL and only shallow
> idle states are enabled in idle_get_state(), should we skip
> SCHED_IDLE to achieve work conservation?
That's probably getting the most out of it.
That being said, strictly speaking the SCHED_IDLE CPU and the
SHALLOW_IDLE CPU may still share a power and thermal budget, which
may make preempting the sched-idle task on SCHED_IDLE CPU the
better choice.