Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
From: Hao Jia
Date: Tue Oct 14 2025 - 03:43:29 EST
Hello Aaron,
On 2025/9/29 15:46, Aaron Lu wrote:
When a cfs_rq is to be throttled, its limbo list should be empty and
that's why there is a warn in tg_throttle_down() for non empty
cfs_rq->throttled_limbo_list.
When running a test with the following hierarchy:
root
/ \
A* ...
/ | \ ...
B
/ \
C*
where both A and C have quota settings, that warn on non empty limbo list
is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
part of the cfs_rq for the sake of simpler representation).
I encountered a similar warning a while ago and fixed it. I have a question I'd like to ask. tg_unthrottle_up(cfs_rq_C) calls enqueue_task_fair(p) to enqueue a task, which requires that the runtime_remaining of task p's entire task_group hierarchy be greater than 0.
In addition to the case you fixed above,
When bandwidth is running normally, Is it possible that there's a corner case where cfs_A->runtime_remaining > 0, but cfs_B->runtime_remaining < 0 could trigger a similar warning?
So, I previously tried to fix this issue using the following code, adding the ENQUEUE_THROTTLE flag to ensure that tasks enqueued in tg_unthrottle_up() aren't throttled.
---
kernel/sched/fair.c | 6 ++++--
kernel/sched/sched.h | 1 +
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc389af8e..128efa2eba57 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5290,7 +5290,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
se->on_rq = 1;
if (cfs_rq->nr_queued == 1) {
- check_enqueue_throttle(cfs_rq);
+ if (!(flags & ENQUEUE_THROTTLE))
+ check_enqueue_throttle(cfs_rq);
+
list_add_leaf_cfs_rq(cfs_rq);
#ifdef CONFIG_CFS_BANDWIDTH
if (cfs_rq->pelt_clock_throttled) {
@@ -5905,7 +5907,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
list_del_init(&p->throttle_node);
p->throttled = false;
- enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+ enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
}
/* Add cfs_rq with load or one or more already running entities to the list */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b5367c514c14..871dfb761676 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2358,6 +2358,7 @@ extern const u32 sched_prio_to_wmult[40];
#define ENQUEUE_MIGRATING 0x100
#define ENQUEUE_DELAYED 0x200
#define ENQUEUE_RQ_SELECTED 0x400
+#define ENQUEUE_THROTTLE 0x800
#define RETRY_TASK ((void *)-1UL)
---
Unfortunately, I tried to build some tests locally and didn't reproduce this corner case.
Thanks,
Hao