Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

From: Chen, Yu C
Date: Sun Aug 17 2025 - 23:11:14 EST


On 8/18/2025 10:50 AM, Aaron Lu wrote:
On Sun, Aug 17, 2025 at 04:50:50PM +0800, Chen, Yu C wrote:
On 7/15/2025 3:16 PM, Aaron Lu wrote:
From: Valentin Schneider <vschneid@xxxxxxxxxx>

In current throttle model, when a cfs_rq is throttled, its entity will
be dequeued from cpu's rq, making tasks attached to it not able to run,
thus achiveing the throttle target.

This has a drawback though: assume a task is a reader of percpu_rwsem
and is waiting. When it gets woken, it can not run till its task group's
next period comes, which can be a relatively long time. Waiting writer
will have to wait longer due to this and it also makes further reader
build up and eventually trigger task hung.

To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, record its throttled status but do not remove
it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
they get picked, add a task work to them so that when they return
to user, they can be dequeued there. In this way, tasks throttled will
not hold any kernel resources. And on unthrottle, enqueue back those
tasks so they can continue to run.

Throttled cfs_rq's PELT clock is handled differently now: previously the
cfs_rq's PELT clock is stopped once it entered throttled state but since
now tasks(in kernel mode) can continue to run, change the behaviour to
stop PELT clock only when the throttled cfs_rq has no tasks left.

Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Suggested-by: Chengming Zhou <chengming.zhou@xxxxxxxxx> # tag on pick
Signed-off-by: Valentin Schneider <vschneid@xxxxxxxxxx>
Signed-off-by: Aaron Lu <ziqianlu@xxxxxxxxxxxxx>
---

[snip]


@@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
+ struct task_struct *p;
+ bool throttled;
again:
cfs_rq = &rq->cfs;
if (!cfs_rq->nr_queued)
return NULL;
+ throttled = false;
+
do {
/* Might not have done put_prev_entity() */
if (cfs_rq->curr && cfs_rq->curr->on_rq)
update_curr(cfs_rq);
- if (unlikely(check_cfs_rq_runtime(cfs_rq)))
- goto again;
+ throttled |= check_cfs_rq_runtime(cfs_rq);
se = pick_next_entity(rq, cfs_rq);
if (!se)
@@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
- return task_of(se);
+ p = task_of(se);
+ if (unlikely(throttled))
+ task_throttle_setup_work(p);
+ return p;
}

Previously, I was wondering if the above change might impact
wakeup latency in some corner cases: If there are many tasks
enqueued on a throttled cfs_rq, the above pick-up mechanism
might return an invalid p repeatedly (where p is dequeued,

By invalid, do you mean task that is in a throttled hierarchy?


Yes.

and a reschedule is triggered in throttle_cfs_rq_work() to
pick the next p; then the new p is found again on a throttled
cfs_rq). Before the above change, the entire cfs_rq's corresponding
sched_entity was dequeued in throttle_cfs_rq(): se = cfs_rq->tg->se(cpu)


Yes this is true and it sounds inefficient, but these newly woken tasks
may hold some kernel resources like a reader lock so we really want them
to finish their kernel jobs and release that resource before being
throttled or it can block/impact other tasks and even cause the whole
system to hung.


I see. Always dequeue each task during their ret2user phase would be safer.

thanks,
Chenyu
So I did some tests for this scenario on a Xeon with 6 NUMA nodes and
384 CPUs. I created 10 levels of cgroups and ran schbench on the leaf
cgroup. The results show that there is not much impact in terms of
wakeup latency (considering the standard deviation). Based on the data
and my understanding, for this series,

Tested-by: Chen Yu <yu.c.chen@xxxxxxxxx>

Good to know this and thanks a lot for the test!