Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

From: Josh Don
Date: Thu Aug 21 2025 - 14:14:13 EST

Next message: Al Viro: "Re: [PATCH v2] kselftests:grammer correction"
Previous message: Michal Hocko: "Re: [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes"
In reply to: Chengming Zhou: "Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@xxxxxxxxx> wrote:
>
> +cc Josh and Viresh, I forgot to cc you, sorry!

Thanks, missed this previously :)

>
> On 2025/8/20 21:53, Christian Loehle wrote:
> > On 8/19/25 16:32, Chen, Yu C wrote:
> >> On 8/18/2025 9:24 PM, Christian Loehle wrote:
> >>> On 8/18/25 13:47, Chengming Zhou wrote:
> >>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
> >>>> on an assumption that the wakee task can pick a cpu running sched_idle
> >>>> task and preempt it to run, faster than picking an idle cpu to preempt
> >>>> the idle task.
> >>>>
> >>>> This assumption is correct, but it also brings some problems:
> >>>>
> >>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
> >>>> which is already running sched_idle task, instead of utilizing a real
> >>>> idle cpu, so work conservation is somewhat broken.
> >>>>
> >>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
> >>>> sched_idle group running. Look a simple example below.
> >>>>
> >>>> root
> >>>> / \
> >>>> kubepods system
> >>>> / \
> >>>> burstable besteffort
> >>>> (cpu.idle == 1)

Thanks for bringing attention to this scenario, it's been a case I've
worried about but haven't had a good idea about fixing. Ideally we
could find_matching_se(), but we want to do these checks locklessly
and quickly, so that's out of the question. Agree on it being a hard
problem.

One idea is that we at least handle the (what I think is fairly
typical) scenario of a root-level sched_idle group well (a root level
sched_idle group is trivially idle with respect to anything else in
the system that is not also nested under a root-level sched_idle
group). It would be fairly easy to track a nr_idle_queued cfs_rq
field, as well as cache on task enqueue whether it nests under a
sched_idle group.

Best,
Josh

Next message: Al Viro: "Re: [PATCH v2] kselftests:grammer correction"
Previous message: Michal Hocko: "Re: [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes"
In reply to: Chengming Zhou: "Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]