Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

From: K Prateek Nayak
Date: Fri Mar 14 2025 - 05:01:26 EST


Hello Aaron,

On 3/14/2025 2:18 PM, Aaron Lu wrote:
static int tg_throttle_down(struct task_group *tg, void *data)
{
struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct task_struct *p;
+ struct rb_node *node;
+
+ cfs_rq->throttle_count++;
+ if (cfs_rq->throttle_count > 1)
+ return 0;

/* group is entering throttled state, stop time */
- if (!cfs_rq->throttle_count) {
- cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
- list_del_leaf_cfs_rq(cfs_rq);
+ cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);

Once cencern here is that the PELT is seemingly frozen despite the
hierarchy being runnable. I've still not tracked down whether it'll
cause any problems once unthrottled and all throttled time is negated
from the pelt clock but is there any concerns here?
I chose to do it this way because:
1 I expect most of the time, if a task has to continue to run after its
cfs_rq gets throttled, the time is relatively small so should not cause
much impact. But I agree there can be times a task runs relatively long;
2 I think the original intent to freeze cfs_rq's pelt clock on throttle
is so that on unthrottle, it can retore its loada(without its load being
decayed etc.). If I chose to not freeze its pelt clock on throttle
because some task is still running in kernel mode, since some of this
cfs_rq's tasks are throttled, its load can become smaller and this can
impact its load on unthrottle.

I think both approach is not perfect, so I chose the simple one for now
:) Not sure if my thinking is correct though.

Maybe this can be done at dequeue when cfs_rq->nr_queued on a
throttled_hierarchy() reached 0.

Yes, this looks more consistent, maybe I should change to this approach.

I agree the time might be small in most cases but some syscalls with
enough contention in the system can take a while to exit to user mode.
Even I'm not sure what the correct approach is here - should a
subtree's PELT be frozen when the last task dequeues or should we
freeze it for the whole hierarchy once the throttled cfs_rq dequeues?

I'll wait for other folks to chime in since they know these bits
better than me.


+ list_del_leaf_cfs_rq(cfs_rq);

- SCHED_WARN_ON(cfs_rq->throttled_clock_self);
- if (cfs_rq->nr_queued)
- cfs_rq->throttled_clock_self = rq_clock(rq);
+ SCHED_WARN_ON(cfs_rq->throttled_clock_self);
+ if (cfs_rq->nr_queued)
+ cfs_rq->throttled_clock_self = rq_clock(rq);
+
+ WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
+ /*
+ * rq_lock is held, current is (obviously) executing this in kernelspace.
+ *
+ * All other tasks enqueued on this rq have their saved PC at the
+ * context switch, so they will go through the kernel before returning
+ * to userspace. Thus, there are no tasks-in-userspace to handle, just
+ * install the task_work on all of them.
+ */
+ node = rb_first(&cfs_rq->tasks_timeline.rb_root);
+ while (node) {
+ struct sched_entity *se = __node_2_se(node);
+
+ if (!entity_is_task(se))
+ goto next;
+
+ p = task_of(se);
+ task_throttle_setup_work(p);
+next:
+ node = rb_next(node);
+ }
+
+ /* curr is not in the timeline tree */
+ if (cfs_rq->curr && entity_is_task(cfs_rq->curr)) {

I believe we can reach here from pick_next_task_fair() ->
check_cfs_rq_runtime() -> throttle_cfs_rq() in which case cfs_rq->curr
will still be set despite the task being blocked since put_prev_entity()
has not been called yet.

I believe there should be a check for task_on_rq_queued() here for the
current task.

Ah right, I'll see how to fix this.

It may not be necessary with the recent suggestion from Chengming where
you can just add the task work if the task was picked on a throttled
hierarchy.

--
Thanks and Regards,
Prateek


Thanks,
Aaron

+ p = task_of(cfs_rq->curr);
+ task_throttle_setup_work(p);
}
- cfs_rq->throttle_count++;

return 0;
}


[..snip..]

--
Thanks and Regards,
Prateek