Re: [RFC PATCH 0/7] Defer throttle when task exits to user
From: Aaron Lu
Date: Thu Mar 13 2025 - 04:31:25 EST
It appears this mail's message-id is changed and becomes a separate
thread, I'll check what is going wrong, sorry about this.
On Thu, Mar 13, 2025 at 02:20:59AM -0500, Aaron Lu wrote:
> Tests:
> - A basic test to verify functionality like limit cgroup cpu time and
> change task group, affinity etc.
Here is the basic test script:
pid=$$
CG_PATH1=/sys/fs/cgroup/1
CG_PATH2=/sys/fs/cgroup/2
[ -d $CG_PATH1 ] && sudo rmdir $CG_PATH1
[ -d $CG_PATH2 ] && sudo rmdir $CG_PATH2
sudo mkdir -p $CG_PATH1
sudo mkdir -p $CG_PATH2
sudo sh -c "echo $pid > $CG_PATH1/cgroup.procs"
echo "start nop"
~/src/misc/nop &
nop_pid=$!
cat /proc/$nop_pid/cgroup
pidstat -p $nop_pid 1 &
sleep 5
echo "limit $CG_PATH1 to 1/10"
sudo sh -c "echo 10000 100000 > $CG_PATH1/cpu.max"
sleep 5
echo "limit $CG_PATH1 to 5/10"
sudo sh -c "echo 50000 100000 > $CG_PATH1/cpu.max"
sleep 5
echo "move to $CG_PATH2"
sudo sh -c "echo $nop_pid > $CG_PATH2/cgroup.procs"
sleep 5
echo "limit $CG_PATH2 to 5/10"
sudo sh -c "echo 50000 100000 > $CG_PATH2/cpu.max"
sleep 5
echo "limit $CG_PATH2 to 1/10"
sudo sh -c "echo 10000 100000 > $CG_PATH2/cpu.max"
sleep 5
echo "set affinity to cpu3"
taskset -p 0x8 $nop_pid
sleep 5
echo "set affinity to cpu10"
taskset -p 0x400 $nop_pid
sleep 5
echo "unlimit $CG_PATH2"
sudo sh -c "echo max 100000 > $CG_PATH2/cpu.max"
sleep 5
echo "move to $CG_PATH1"
sudo sh -c "echo $nop_pid > $CG_PATH1/cgroup.procs"
sleep 5
echo "change to rr with priority 10"
sudo chrt -r -p 10 $nop_pid
sleep 5
echo "change to fifo with priority 10"
sudo chrt -f -p 10 $nop_pid
sleep 5
echo "change back to fair"
sudo chrt -o -p 0 $nop_pid
sleep 5
echo "unlimit $CG_PATH1"
sudo sh -c "echo max 100000 > $CG_PATH1/cpu.max"
sleep 5
kill $nop_pid
note: nop is a cpu hog that does: while (1) spin();
> - A script that tried to mimic a large cgroup setup is used to see how
> bad it is to unthrottle cfs_rqs and enqueue back large number of tasks
> in hrtime context.
Here are the test scripts:
CG_ROOT=/sys/fs/cgroup
nr_level1=2
nr_level2=100
nr_level3=10
for i in `seq $nr_level1`; do
CG_LEVEL1=$CG_ROOT/$i
echo "cg_level1: $CG_LEVEL1"
[ -d $CG_LEVEL1 ] || sudo mkdir -p $CG_LEVEL1
sudo sh -c "echo +cpu > $CG_LEVEL1/cgroup.subtree_control"
for j in `seq $nr_level2`; do
CG_LEVEL2=$CG_LEVEL1/${i}_$j
echo "cg_level2: $CG_LEVEL2"
[ -d $CG_LEVEL2 ] || sudo mkdir -p $CG_LEVEL2
sudo sh -c "echo +cpu > $CG_LEVEL2/cgroup.subtree_control"
for k in `seq $nr_level3`; do
CG_LEVEL3=$CG_LEVEL2/${i}_${j}_$k
[ -d $CG_LEVEL3 ] || sudo mkdir -p $CG_LEVEL3
~/test/run_in_cg.sh $CG_LEVEL3
done
done
done
function set_quota()
{
quota=$1
for i in `seq $nr_level1`; do
CG_LEVEL1=$CG_ROOT/$i
sudo sh -c "echo $quota 100000 > $CG_LEVEL1/cpu.max"
echo "$CG_LEVEL1: `cat $CG_LEVEL1/cpu.max`"
done
}
while true; do
echo "sleep 20"
sleep 20
echo "set 20cpu quota to first level cgroups"
set_quota 2000000
echo "sleep 20"
sleep 20
echo "set 10cpu quota to first level cgroups"
set_quota 1000000
echo "sleep 20"
sleep 20
echo "set 5cpu quota to first level cgroups"
set_quota 500000
echo "sleep 20"
sleep 20
echo "unlimit first level cgroups"
set_quota max
done
run_in_cg.sh:
set -e
CG_PATH=$1
[ -z "$CG_PATH" ] && {
echo "need cgroup path"
exit
}
echo "CG_PATH: $CG_PATH"
sudo sh -c "echo $$ > $CG_PATH/cgroup.procs"
for i in `seq 10`; do
~/src/misc/nop &
done
> The test was done on a 2sockets/384threads AMD CPU with the following
> cgroup setup: 2 first level cgroups with quota setting, each has 100
> child cgroups and each child cgroup has 10 leaf child cgroups, with a
> total number of 2000 cgroups. In each leaf child cgroup, 10 cpu hog
> tasks are created there. Below is the durations of
> distribute_cfs_runtime() during a 1 minute window:
@durations:
[8K, 16K) 274 |@@@@@@@@@@@@@@@@@@@@@ |
[16K, 32K) 132 |@@@@@@@@@@ |
[32K, 64K) 6 | |
[64K, 128K) 0 | |
[128K, 256K) 2 | |
[256K, 512K) 0 | |
[512K, 1M) 117 |@@@@@@@@@ |
[1M, 2M) 665 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2M, 4M) 10 | |
The bpftrace script used to capture this:
kfunc:distribute_cfs_runtime
{
@start[args->cfs_b] = nsecs;
}
kretfunc:distribute_cfs_runtime
{
if (@start[args->cfs_b]) {
$duration = nsecs - @start[args->cfs_b];
@durations = hist($duration);
delete(@start[args->cfs_b]);
}
}
interval:s:60
{
exit();
}
> So the biggest duration is in 2-4ms range in this hrtime context. How
> bad is this number? I think it is acceptable but maybe the setup I
> created is not complex enough?
> In older kernels where async unthrottle is not available, the largest
> time range can be about 100ms+.