Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

From: Aaron Lu
Date: Wed Mar 19 2025 - 09:45:27 EST


Hi Josh,

On Sat, Mar 15, 2025 at 08:25:53PM -0700, Josh Don wrote:
> Hi Aaron,
>
> > static int tg_throttle_down(struct task_group *tg, void *data)
> > {
> > struct rq *rq = data;
> > struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> > + struct task_struct *p;
> > + struct rb_node *node;
> > +
> > + cfs_rq->throttle_count++;
> > + if (cfs_rq->throttle_count > 1)
> > + return 0;
> >
> > /* group is entering throttled state, stop time */
> > - if (!cfs_rq->throttle_count) {
> > - cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> > - list_del_leaf_cfs_rq(cfs_rq);
> > + cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> > + list_del_leaf_cfs_rq(cfs_rq);
> >
> > - SCHED_WARN_ON(cfs_rq->throttled_clock_self);
> > - if (cfs_rq->nr_queued)
> > - cfs_rq->throttled_clock_self = rq_clock(rq);
> > + SCHED_WARN_ON(cfs_rq->throttled_clock_self);
> > + if (cfs_rq->nr_queued)
> > + cfs_rq->throttled_clock_self = rq_clock(rq);
> > +
> > + WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
> > + /*
> > + * rq_lock is held, current is (obviously) executing this in kernelspace.
> > + *
> > + * All other tasks enqueued on this rq have their saved PC at the
> > + * context switch, so they will go through the kernel before returning
> > + * to userspace. Thus, there are no tasks-in-userspace to handle, just
> > + * install the task_work on all of them.
> > + */
> > + node = rb_first(&cfs_rq->tasks_timeline.rb_root);
> > + while (node) {
> > + struct sched_entity *se = __node_2_se(node);
> > +
> > + if (!entity_is_task(se))
> > + goto next;
> > +
> > + p = task_of(se);
> > + task_throttle_setup_work(p);
> > +next:
> > + node = rb_next(node);
> > + }
>
> I'd like to strongly push back on this approach. This adds quite a lot
> of extra computation to an already expensive path
> (throttle/unthrottle). e.g. this function is part of the cgroup walk
> and so it is already O(cgroups) for the number of cgroups in the
> hierarchy being throttled. This gets even worse when you consider that
> we repeat this separately across all the cpus that the
> bandwidth-constrained group is running on. Keep in mind that
> throttle/unthrottle is done with rq lock held and IRQ disabled.

Agree that it's not good to do this O(nr_task) thing in
throttle/unthrottle path. As Chengming mentioned, throttle path can
avoid this but unthrottle path does not have an easy way to avoid this.

> In K Prateek's last RFC, there was discussion of using context
> tracking; did you consider that approach any further? We could keep

I haven't tried that approach yet.

> track of the number of threads within a cgroup hierarchy currently in
> kernel mode (similar to h_nr_runnable), and thus simplify down the
> throttling code here.

My initial feeling is the implementation looks pretty complex. If it can
be simplified somehow, that would be great.

Best regards,
Aaron