Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

From: Chen, Yu C
Date: Sun Aug 17 2025 - 04:51:19 EST


On 7/15/2025 3:16 PM, Aaron Lu wrote:
From: Valentin Schneider <vschneid@xxxxxxxxxx>

In current throttle model, when a cfs_rq is throttled, its entity will
be dequeued from cpu's rq, making tasks attached to it not able to run,
thus achiveing the throttle target.

This has a drawback though: assume a task is a reader of percpu_rwsem
and is waiting. When it gets woken, it can not run till its task group's
next period comes, which can be a relatively long time. Waiting writer
will have to wait longer due to this and it also makes further reader
build up and eventually trigger task hung.

To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, record its throttled status but do not remove
it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
they get picked, add a task work to them so that when they return
to user, they can be dequeued there. In this way, tasks throttled will
not hold any kernel resources. And on unthrottle, enqueue back those
tasks so they can continue to run.

Throttled cfs_rq's PELT clock is handled differently now: previously the
cfs_rq's PELT clock is stopped once it entered throttled state but since
now tasks(in kernel mode) can continue to run, change the behaviour to
stop PELT clock only when the throttled cfs_rq has no tasks left.

Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Suggested-by: Chengming Zhou <chengming.zhou@xxxxxxxxx> # tag on pick
Signed-off-by: Valentin Schneider <vschneid@xxxxxxxxxx>
Signed-off-by: Aaron Lu <ziqianlu@xxxxxxxxxxxxx>
---

[snip]


@@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
+ struct task_struct *p;
+ bool throttled;
again:
cfs_rq = &rq->cfs;
if (!cfs_rq->nr_queued)
return NULL;
+ throttled = false;
+
do {
/* Might not have done put_prev_entity() */
if (cfs_rq->curr && cfs_rq->curr->on_rq)
update_curr(cfs_rq);
- if (unlikely(check_cfs_rq_runtime(cfs_rq)))
- goto again;
+ throttled |= check_cfs_rq_runtime(cfs_rq);
se = pick_next_entity(rq, cfs_rq);
if (!se)
@@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
- return task_of(se);
+ p = task_of(se);
+ if (unlikely(throttled))
+ task_throttle_setup_work(p);
+ return p;
}

Previously, I was wondering if the above change might impact
wakeup latency in some corner cases: If there are many tasks
enqueued on a throttled cfs_rq, the above pick-up mechanism
might return an invalid p repeatedly (where p is dequeued,
and a reschedule is triggered in throttle_cfs_rq_work() to
pick the next p; then the new p is found again on a throttled
cfs_rq). Before the above change, the entire cfs_rq's corresponding
sched_entity was dequeued in throttle_cfs_rq(): se = cfs_rq->tg->se(cpu)

So I did some tests for this scenario on a Xeon with 6 NUMA nodes and
384 CPUs. I created 10 levels of cgroups and ran schbench on the leaf
cgroup. The results show that there is not much impact in terms of
wakeup latency (considering the standard deviation). Based on the data
and my understanding, for this series,

Tested-by: Chen Yu <yu.c.chen@xxxxxxxxx>


Tested script parameters are borrowed from the previous attached ones:
#!/bin/bash

if [ $# -ne 1 ]; then
echo "please provide cgroup level"
exit
fi

N=$1
current_path="/sys/fs/cgroup"

for ((i=1; i<=N; i++)); do
new_dir="${current_path}/${i}"
mkdir -p "$new_dir"
if [ "$i" -ne "$N" ]; then
echo '+cpu +memory +pids' > ${new_dir}/cgroup.subtree_control
fi
current_path="$new_dir"
done

echo "current_path:${current_path}"
echo "1600000 100000" > ${current_path}/cpu.max
echo "34G" > ${current_path}/memory.max

echo $$ > ${current_path}/cgroup.procs
#./run-mmtests.sh --no-monitor --config config-schbench baseline
./run-mmtests.sh --no-monitor --config config-schbench sch_throt


pids=$(cat "${current_path}/cgroup.procs")
for pid in $pids; do
echo $pid > "/sys/fs/cgroup/cgroup.procs" 2>/dev/null
done
for ((i=N; i>=1; i--)); do
rmdir ${current_path}
current_path=$(dirname "$current_path")
done


Results:

schbench thread = 1
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
//the baseline's std% is 35%, the change should not be a problem
Wakeup Latencies 99.0th 15.00(5.29) 17.00(1.00) -13.33%
Request Latencies 99.0th 3830.67(33.31) 3854.67(25.72) -0.63%
RPS 50.0th 1598.00(4.00) 1606.00(4.00) +0.50%
Average RPS 1597.77(5.13) 1606.11(4.75) +0.52%

schbench thread = 2
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 18.33(0.58) 18.67(0.58) -1.85%
Request Latencies 99.0th 3868.00(49.96) 3854.67(44.06) +0.34%
RPS 50.0th 3185.33(4.62) 3204.00(8.00) +0.59%
Average RPS 3186.49(2.70) 3204.21(11.25) +0.56%

schbench thread = 4
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 19.33(1.15) 19.33(0.58) 0.00%
Request Latencies 99.0th 35690.67(517.31) 35946.67(517.31) -0.72%
RPS 50.0th 4418.67(18.48) 4434.67(9.24) +0.36%
Average RPS 4414.38(16.94) 4436.02(8.77) +0.49%

schbench thread = 8
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 22.67(0.58) 22.33(0.58) +1.50%
Request Latencies 99.0th 73002.67(147.80) 72661.33(147.80) +0.47%
RPS 50.0th 4376.00(16.00) 4392.00(0.00) +0.37%
Average RPS 4373.89(15.04) 4393.88(6.22) +0.46%

schbench thread = 16
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 29.00(2.65) 29.00(3.61) 0.00%
Request Latencies 99.0th 88704.00(0.00) 88704.00(0.00) 0.00%
RPS 50.0th 4274.67(24.44) 4290.67(9.24) +0.37%
Average RPS 4277.27(24.80) 4287.97(9.80) +0.25%

schbench thread = 32
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 100.00(22.61) 82.00(16.46) +18.00%
Request Latencies 99.0th 100138.67(295.60) 100053.33(147.80) +0.09%
RPS 50.0th 3942.67(20.13) 3916.00(42.33) -0.68%
Average RPS 3919.39(19.01) 3892.39(42.26) -0.69%

schbench thread = 63
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 94848.00(0.00) 94336.00(0.00) +0.54%
//the baseline's std% is 19%, the change should not be a problem
Request Latencies 99.0th 264618.67(51582.78) 298154.67(591.21) -12.67%
RPS 50.0th 2641.33(4.62) 2628.00(8.00) -0.50%
Average RPS 2659.49(8.88) 2636.17(7.58) -0.88%

thanks,
Chenyu