Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

From: K Prateek Nayak
Date: Thu Mar 13 2025 - 14:16:08 EST


Hello Aaron,

P.S. I've fixed the wrapped lines and have been testing the series. So
far I haven't run into any issues yet on my machine. Will report back if
anything surface.

I've few comments inlined below.

On 3/13/2025 12:51 PM, Aaron Lu wrote:

[..snip..]

+static inline void task_throttle_setup_work(struct task_struct *p)
+{
+ /*
+ * Kthreads and exiting tasks don't return to userspace, so adding the
+ * work is pointless
+ */
+ if ((p->flags & (PF_EXITING | PF_KTHREAD)))
+ return;
+
+ if (task_has_throttle_work(p))
+ return;
+
+ task_work_add(p, &p->sched_throttle_work, TWA_RESUME);

Does it make sense to add a throttle work to a delayed task? It may be
dequeued soon and when it is queued back, the throttle situation might
have changed but the work is unnecessarily run. Could the throttle work
be instead added at the point of enqueue for delayed tasks?

+}
+
static int tg_throttle_down(struct task_group *tg, void *data)
{
struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct task_struct *p;
+ struct rb_node *node;
+
+ cfs_rq->throttle_count++;
+ if (cfs_rq->throttle_count > 1)
+ return 0;

/* group is entering throttled state, stop time */
- if (!cfs_rq->throttle_count) {
- cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
- list_del_leaf_cfs_rq(cfs_rq);
+ cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);

Once cencern here is that the PELT is seemingly frozen despite the
hierarchy being runnable. I've still not tracked down whether it'll
cause any problems once unthrottled and all throttled time is negated
from the pelt clock but is there any concerns here?

Maybe this can be done at dequeue when cfs_rq->nr_queued on a
throttled_hierarchy() reached 0.

+ list_del_leaf_cfs_rq(cfs_rq);

- SCHED_WARN_ON(cfs_rq->throttled_clock_self);
- if (cfs_rq->nr_queued)
- cfs_rq->throttled_clock_self = rq_clock(rq);
+ SCHED_WARN_ON(cfs_rq->throttled_clock_self);
+ if (cfs_rq->nr_queued)
+ cfs_rq->throttled_clock_self = rq_clock(rq);
+
+ WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
+ /*
+ * rq_lock is held, current is (obviously) executing this in kernelspace.
+ *
+ * All other tasks enqueued on this rq have their saved PC at the
+ * context switch, so they will go through the kernel before returning
+ * to userspace. Thus, there are no tasks-in-userspace to handle, just
+ * install the task_work on all of them.
+ */
+ node = rb_first(&cfs_rq->tasks_timeline.rb_root);
+ while (node) {
+ struct sched_entity *se = __node_2_se(node);
+
+ if (!entity_is_task(se))
+ goto next;
+
+ p = task_of(se);
+ task_throttle_setup_work(p);
+next:
+ node = rb_next(node);
+ }
+
+ /* curr is not in the timeline tree */
+ if (cfs_rq->curr && entity_is_task(cfs_rq->curr)) {

I believe we can reach here from pick_next_task_fair() ->
check_cfs_rq_runtime() -> throttle_cfs_rq() in which case cfs_rq->curr
will still be set despite the task being blocked since put_prev_entity()
has not been called yet.

I believe there should be a check for task_on_rq_queued() here for the
current task.

+ p = task_of(cfs_rq->curr);
+ task_throttle_setup_work(p);
}
- cfs_rq->throttle_count++;

return 0;
}


[..snip..]

--
Thanks and Regards,
Prateek