Re: [RFC PATCH 3/7] sched/fair: Handle unthrottle path for task based throttle

From: K Prateek Nayak
Date: Fri Mar 14 2025 - 13:52:39 EST


Hello Aaron,

On 3/14/2025 4:13 PM, Aaron Lu wrote:
On Fri, Mar 14, 2025 at 09:23:47AM +0530, K Prateek Nayak wrote:
Hello Aaron,

On 3/13/2025 12:51 PM, Aaron Lu wrote:

[..snip..]

---
kernel/sched/fair.c | 132 +++++++++++++++-----------------------------
1 file changed, 45 insertions(+), 87 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab403ff7d53c8..4a95fe3785e43 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5366,18 +5366,17 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct
sched_entity *se, int flags)

if (cfs_rq->nr_queued == 1) {
check_enqueue_throttle(cfs_rq);
- if (!throttled_hierarchy(cfs_rq)) {
- list_add_leaf_cfs_rq(cfs_rq);
- } else {
+ list_add_leaf_cfs_rq(cfs_rq);
#ifdef CONFIG_CFS_BANDWIDTH
+ if (throttled_hierarchy(cfs_rq)) {
struct rq *rq = rq_of(cfs_rq);

if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
cfs_rq->throttled_clock = rq_clock(rq);
if (!cfs_rq->throttled_clock_self)
cfs_rq->throttled_clock_self = rq_clock(rq);

These bits probabaly need revisiting. From what I understand, these
stats were maintained to know when a task was woken up on a
throttled hierarchy which was not connected to the parent essentially
tracking the amount of time runnable tasks were waiting on the
cfs_rq before an unthrottle event allowed them to be picked.

Do you mean these throttled_clock stats?

I think they are here because we do not record the throttled_clock for
empty cfs_rqs and once the cfs_rq has task enqueued, it needs to record
its throttled_clock. This is done in commit 79462e8c879a("sched: don't
account throttle time for empty groups") by Josh. I don't think per-task
throttle change this.

With this said, I think there is a gap in per-task throttle, i.e. when
all tasks are dequeued from this throttled cfs_rq, we should record its
throttled_time and clear its throttled_clock.

Yes but then what it'll track is the amount of time task were running
when the cfs_rq was on a throttled hierarchy. Is that what we want to
track or something else.

The commit log for 677ea015f231 ("sched: add throttled time stat for
throttled children") says the following for "throttled_clock_self_time":

We currently export the total throttled time for cgroups that are given
a bandwidth limit. This patch extends this accounting to also account
the total time that each children cgroup has been throttled.

This is useful to understand the degree to which children have been
affected by the throttling control. Children which are not runnable
during the entire throttled period, for example, will not show any
self-throttling time during this period.

but with per-task model, it is probably the amount of time that
"throttled_limbo_list" has a task on it since they are runnable
but are in-fact waiting for an unthrottle.

If a task enqueues itself on a throttled hierarchy and then blocks
again before exiting to the userspace, it should not count in
"throttled_clock_self_time" since the task was runnable the whole
time despite the hierarchy being frozen.



With per-task throttle, these semantics no longer apply since a woken
task will run and dequeue itself when exiting to userspace.

Josh do you have any thoughts on this?

-#endif
}
+#endif
}
}


@@ -5947,12 +5967,16 @@ static int tg_throttle_down(struct task_group
*tg, void *data)

/* group is entering throttled state, stop time */
cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
- list_del_leaf_cfs_rq(cfs_rq);

SCHED_WARN_ON(cfs_rq->throttled_clock_self);
if (cfs_rq->nr_queued)
cfs_rq->throttled_clock_self = rq_clock(rq);

+ if (!cfs_rq->nr_queued) {
+ list_del_leaf_cfs_rq(cfs_rq);
+ return 0;
+ }
+

This bit can perhaps go in Patch 2?

I kept all the changes to leaf cfs_rq handling in one patch, I think it
is easier to review :-)

Thank you! That would be great.


Thanks,
Aaron

WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
/*
* rq_lock is held, current is (obviously) executing this in kernelspace.

--
Thanks and Regards,
Prateek