Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()
From: Josh Don
Date: Thu Aug 21 2025 - 14:14:13 EST
On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@xxxxxxxxx> wrote:
>
> +cc Josh and Viresh, I forgot to cc you, sorry!
Thanks, missed this previously :)
>
> On 2025/8/20 21:53, Christian Loehle wrote:
> > On 8/19/25 16:32, Chen, Yu C wrote:
> >> On 8/18/2025 9:24 PM, Christian Loehle wrote:
> >>> On 8/18/25 13:47, Chengming Zhou wrote:
> >>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
> >>>> on an assumption that the wakee task can pick a cpu running sched_idle
> >>>> task and preempt it to run, faster than picking an idle cpu to preempt
> >>>> the idle task.
> >>>>
> >>>> This assumption is correct, but it also brings some problems:
> >>>>
> >>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
> >>>> which is already running sched_idle task, instead of utilizing a real
> >>>> idle cpu, so work conservation is somewhat broken.
> >>>>
> >>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
> >>>> sched_idle group running. Look a simple example below.
> >>>>
> >>>> root
> >>>> / \
> >>>> kubepods system
> >>>> / \
> >>>> burstable besteffort
> >>>> (cpu.idle == 1)
Thanks for bringing attention to this scenario, it's been a case I've
worried about but haven't had a good idea about fixing. Ideally we
could find_matching_se(), but we want to do these checks locklessly
and quickly, so that's out of the question. Agree on it being a hard
problem.
One idea is that we at least handle the (what I think is fairly
typical) scenario of a root-level sched_idle group well (a root level
sched_idle group is trivially idle with respect to anything else in
the system that is not also nested under a root-level sched_idle
group). It would be fairly easy to track a nr_idle_queued cfs_rq
field, as well as cache on task enqueue whether it nests under a
sched_idle group.
Best,
Josh