[RFC PATCH v2 0/7] Defer throttle when task exits to user

From: Aaron Lu
Date: Wed Apr 09 2025 - 08:09:24 EST


This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@xxxxxxxxxx/

Valentin has described the problem very well in the above link. We also
have task hung problem from time to time in our environment due to cfs quota.
It is mostly visible with rwsem: when a reader is throttled, writer comes in
and has to wait, the writer also makes all subsequent readers wait,
causing problems of priority inversion or even whole system hung.

To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, mark its throttled status but do not
remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
when they get picked, add a task work to them so that when they return
to user, they can be dequeued. In this way, tasks throttled will not
hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
those throttled tasks.

There are consequences because of this new throttle model, e.g. for a
cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
return2user path, one task still running in kernel mode, this cfs_rq is
in a partial throttled state:
- Should its pelt clock be frozen?
- Should this state be accounted into throttled_time?

For pelt clock, I chose to keep the current behavior to freeze it on
cfs_rq's throttle time. The assumption is that tasks running in kernel
mode should not last too long, freezing the cfs_rq's pelt clock can keep
its load and its corresponding sched_entity's weight. Hopefully, this can
result in a stable situation for the remaining running tasks to quickly
finish their jobs in kernel mode.

For throttle time accounting, I can see several possibilities:
- Similar to current behavior: starts accounting when cfs_rq gets
throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
task when it gets throttled and eventually, that task doesn't return
to user but blocks, then this cfs_rq has no tasks on throttled list
but time is accounted as throttled; Patch2 and patch3 implements this
accounting(simple, fewer code change).
- Starts accounting when the throttled cfs_rq has at least one task on
its throttled list; stops accounting when it's unthrottled. This kind
of over accounts throttled time because partial throttle state is
accounted.
- Starts accounting when the throttled cfs_rq has no tasks left and its
throttled list is not empty; stops accounting when this cfs_rq is
unthrottled; This kind of under accounts throttled time because partial
throttle state is not accounted. Patch7 implements this accounting.
I do not have a strong feeling which accounting is the best, it's open
for discussion.

There is also the concern of increased duration of (un)throttle operations
in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
setup on a 2sockets/384cpus AMD server, the longest duration of
distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
For throttle path, with Chengming's suggestion to move "task work setup"
from throttle time to pick time, it's not an issue anymore.

Patches:
Patch1 is preparation work;

Patch2-3 provide the main functionality.
Patch2 deals with throttle path: when a cfs_rq is to be throttled, mark
throttled status for this cfs_rq and when tasks in throttled hierarchy
gets picked, add a task work to them so that when those tasks return to
user space, the task work can throttle it by dequeuing the task and
remember this by adding the task to its cfs_rq's limbo list;
Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
enqueue back those tasks in limbo list;

Patch4 deals with the dequeue path when task changes group, sched class
etc. Task that is throttled is dequeued in fair, but task->on_rq is
still set so when it changes task group or sched class or has affinity
setting change, core will firstly dequeue it. But since this task is
already dequeued in fair class, this patch handle this situation.

Patch5-6 are clean ups. Some code are obsolete after switching to task
based throttle mechanism.

Patch7 implements an alternative accounting mechanism for task based
throttle.

Changes since v1:
- Move "add task work" from throttle time to pick time, suggested by
Chengming Zhou;
- Use scope_gard() and cond_resched_tasks_rcu_qs() in
throttle_cfs_rq_work(), suggested by K Prateek Nayak;
- Remove now obsolete throttled_lb_pair(), suggested by K Prateek Nayak;
- Fix cfs_rq->runtime_remaining condition check in unthrottle_cfs_rq(),
suggested by K Prateek Nayak;
- Fix h_nr_runnable accounting for delayed dequeue case when task based
throttle is in use;
- Implemented an alternative way of throttle time accounting for
discussion purpose;
- Make !CONFIG_CFS_BANDWIDTH build.
I hope I didn't omit any feedbacks I've received, but feel free to let me
know if I did.

As in v1, all change logs are written by me and if they read bad, it's
my fault.

Comments are welcome.

Base commit: tip/sched/core, commit 6432e163ba1b("sched/isolation: Make
use of more than one housekeeping cpu").

Aaron Lu (4):
sched/fair: Take care of group/affinity/sched_class change for
throttled task
sched/fair: get rid of throttled_lb_pair()
sched/fair: fix h_nr_runnable accounting with per-task throttle
sched/fair: alternative way of accounting throttle time

Valentin Schneider (3):
sched/fair: Add related data structure for task based throttle
sched/fair: Handle throttle path for task based throttle
sched/fair: Handle unthrottle path for task based throttle

include/linux/sched.h | 4 +
kernel/sched/core.c | 3 +
kernel/sched/fair.c | 449 ++++++++++++++++++++++--------------------
kernel/sched/sched.h | 7 +
4 files changed, 248 insertions(+), 215 deletions(-)

--
2.39.5