[RFC PATCH 0/7] Defer throttle when task exits to user
From: Aaron Lu
Date: Thu Mar 13 2025 - 03:22:20 EST
This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@xxxxxxxxxx/
Valentin has described the problem very well in the above link. We also
have task hung problem from time to time in our environment due to cfs quota.
It is mostly visible with rwsem: when a reader is throttled, writer comes in
and has to wait, the writer also makes all subsequent readers wait,
causing problems of priority inversion or even whole system hung.
Changes I've made since Valentin's v3:
- Use enqueue_task_fair() and dequeue_task_fair() in cfs_rq's throttle
and unthrottle path;
- Get rid of irq_work, since the task work that is supposed to throttle
the task can figure things out and do things accordingly, so no need
to have a irq_work to cancel a no longer needed task work;
- Several fix like taking care of task group change, sched class change
etc. for throttled task;
- tasks_rcu fix with this task based throttle.
Tests:
- A basic test to verify functionality like limit cgroup cpu time and
change task group, affinity etc.
- A script that tried to mimic a large cgroup setup is used to see how
bad it is to unthrottle cfs_rqs and enqueue back large number of tasks
in hrtime context.
The test was done on a 2sockets/384threads AMD CPU with the following
cgroup setup: 2 first level cgroups with quota setting, each has 100
child cgroups and each child cgroup has 10 leaf child cgroups, with a
total number of 2000 cgroups. In each leaf child cgroup, 10 cpu hog
tasks are created there. Below is the durations of
distribute_cfs_runtime() during a 1 minute window:
@durations:
[8K, 16K) 274 |@@@@@@@@@@@@@@@@@@@@@
|
[16K, 32K) 132 |@@@@@@@@@@
|
[32K, 64K) 6 |
|
[64K, 128K) 0 |
|
[128K, 256K) 2 |
|
[256K, 512K) 0 |
|
[512K, 1M) 117 |@@@@@@@@@
|
[1M, 2M) 665
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2M, 4M) 10 |
|
So the biggest duration is in 2-4ms range in this hrtime context. How
bad is this number? I think it is acceptable but maybe the setup I
created is not complex enough?
In older kernels where async unthrottle is not available, the largest
time range can be about 100ms+.
Patches:
The patchset is arranged to get the basic functionality done first and
then deal with special cases. I hope this can make it easier to review.
Patch1 is preparation work;
Patch2-3 provide the main functionality.
Patch2 deals with throttle path: when a cfs_rq is to be throttled, add a
task work to each of its tasks so that when those tasks return to user, the
task work can throttle it by dequeuing the task and remember this by
adding the task to its cfs_rq's limbo list;
Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
enqueue back those tasks in limbo list;
Patch4-5 deal with special cases.
Patch4 deals with task migration: if a task migrates to a throttled
cfs_rq, setup throttle work for it. If otherwise a task that already has
task work added migrated to a not throttled cfs_rq, its task work will
remain: the work handler will figure things out and skip the throttle.
This also deals with setting throttle task work for tasks that switched
to fair class, changed group etc. because all these things need enqueue
the task to the target cfs_rq;
Patch5 deals with the dequeue path when task changes group, sched class
etc. Task that is throttled is dequeued in fair, but task->on_rq is
still set so when it changes task group, sched class or has affinity
setting change, core will firstly dequeue it. But since this task is
already dequeued in fair class, this patch handle this situation.
Patch6-7 are two fixes while doing test. I can also fold them in if that
is better.
Patch6 makes CONFIG_TASKS_RCU happy. Throttled tasks get scheduled in
tasks_work_run() by cond_resched() but that is a preempt schedule and
doesn't mark a task rcu quiescent state, so I add a schedule call in
throttle task work directly.
Patch7 fixed a problem where unthrottle path can cause throttle to
happen again when enqueuing task.
All the patches changelogs are written by me, so if the changelogs look
poor, it's my bad.
Comments are welcome. If you see any problems or issues with this
approach, please feel free to let me know, thanks.
Base commit: tip/sched/core, commit fd881d0a085f("rseq: Fix segfault on
registration when rseq_cs is non-zero").
Known issues:
- !CONFIG_CFS_BANDWIDTH is totally not tested yet;
- task_is_throttled_fair() could probably be replaced with
task_is_throttled() now but I'll leave this to next version.
- cfs_rq's pelt clock is stopped on throttle while it can still have
tasks running(like some task is still running in kernel space).
It's also possible to keep its pelt clock running till its last task
is throttled/dequeued, but in this way, this cfs_rq's load may be
decreased too much since many of its tasks are throttled. For now,
keep it simple by keeping the current behavior.
Aaron Lu (4):
sched/fair: Take care of migrated task for task based throttle
sched/fair: Take care of group/affinity/sched_class change for
throttled task
sched/fair: fix tasks_rcu with task based throttle
sched/fair: Make sure cfs_rq has enough runtime_remaining on
unthrottle path
Valentin Schneider (3):
sched/fair: Add related data structure for task based throttle
sched/fair: Handle throttle path for task based throttle
sched/fair: Handle unthrottle path for task based throttle
include/linux/sched.h | 4 +
kernel/sched/core.c | 3 +
kernel/sched/fair.c | 380 +++++++++++++++++++++++-------------------
kernel/sched/sched.h | 3 +
4 files changed, 216 insertions(+), 174 deletions(-)
--
2.39.5