Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining

From: Aaron Lu

Date: Tue Sep 30 2025 - 05:27:45 EST


Hi Prateek,

On Tue, Sep 30, 2025 at 02:28:16PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
>
> On 9/30/2025 1:26 PM, Aaron Lu wrote:
> > On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote:
> > ... ...
> >> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task()
> >> if we find cfs_rq->throttled_limbo_list to be empty?
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 18a30ae35441..fd2d4dad9c27 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p)
> >> */
> >> if (throttled_hierarchy(cfs_rq) &&
> >> !task_current_donor(rq_of(cfs_rq), p)) {
> > /*
> > * Make sure to throttle this cfs_rq or it can be unthrottled
> > * with no runtime_remaining and gets throttled again on its
> > * unthrottle path.
> > */
> >> + if (list_empty(&cfs_rq->throttled_limbo_list))
> >> + check_enqueue_throttle(cfs_rq);
> >
> > BTW, do you think a comment is needed? Something like the above, not
> > sure if it's too redundant though, feel free to let me know your
> > thoughts, thanks.
>
> Now that I'm looking at it again, I think we should actually do a:
>
> for_each_entity(se)
> check_enqueue_throttle(cfs_rq_of(se));

Nice catch and sigh.

>
> The reason being, we can have:
>
> root -> A (throttled) -> B -> C
>
> Consider B has runtime_remaining = 0, and subsequently a throttled task
> is queued onto C. Ideally, we should start the B/W timer for B at that
> point but we bail out after queuing it on C. Thoughts?
>

If we want to make sure no cfs_rqs with runtime_enabled gets unthrottled
with zero runtime_remaining, agree we will have to do that in a hierarchy
way.

I don't feel good about that for_each_entity(se) check_enqueue_throttle()
though, it made me feel we are duplicating enqueue_task_fair() somehow...

With this said, if we have to do that hierarchical check, I would prefer
to throttle it upfront in tg_set_cfs_bandwidth() :) The useless assign
of runtime is just 1ns, and it should only affect the first period, so
shouldn't matter much?