Re: [RFC PATCH 0/7] Defer throttle when task exits to user

From: Aaron Lu
Date: Thu Mar 13 2025 - 04:31:25 EST


It appears this mail's message-id is changed and becomes a separate
thread, I'll check what is going wrong, sorry about this.

On Thu, Mar 13, 2025 at 02:20:59AM -0500, Aaron Lu wrote:
> Tests:
> - A basic test to verify functionality like limit cgroup cpu time and
> change task group, affinity etc.

Here is the basic test script:

pid=$$
CG_PATH1=/sys/fs/cgroup/1
CG_PATH2=/sys/fs/cgroup/2

[ -d $CG_PATH1 ] && sudo rmdir $CG_PATH1
[ -d $CG_PATH2 ] && sudo rmdir $CG_PATH2

sudo mkdir -p $CG_PATH1
sudo mkdir -p $CG_PATH2

sudo sh -c "echo $pid > $CG_PATH1/cgroup.procs"

echo "start nop"
~/src/misc/nop &
nop_pid=$!
cat /proc/$nop_pid/cgroup
pidstat -p $nop_pid 1 &
sleep 5

echo "limit $CG_PATH1 to 1/10"
sudo sh -c "echo 10000 100000 > $CG_PATH1/cpu.max"
sleep 5

echo "limit $CG_PATH1 to 5/10"
sudo sh -c "echo 50000 100000 > $CG_PATH1/cpu.max"
sleep 5

echo "move to $CG_PATH2"
sudo sh -c "echo $nop_pid > $CG_PATH2/cgroup.procs"
sleep 5

echo "limit $CG_PATH2 to 5/10"
sudo sh -c "echo 50000 100000 > $CG_PATH2/cpu.max"
sleep 5

echo "limit $CG_PATH2 to 1/10"
sudo sh -c "echo 10000 100000 > $CG_PATH2/cpu.max"
sleep 5

echo "set affinity to cpu3"
taskset -p 0x8 $nop_pid
sleep 5

echo "set affinity to cpu10"
taskset -p 0x400 $nop_pid
sleep 5

echo "unlimit $CG_PATH2"
sudo sh -c "echo max 100000 > $CG_PATH2/cpu.max"
sleep 5

echo "move to $CG_PATH1"
sudo sh -c "echo $nop_pid > $CG_PATH1/cgroup.procs"
sleep 5

echo "change to rr with priority 10"
sudo chrt -r -p 10 $nop_pid
sleep 5

echo "change to fifo with priority 10"
sudo chrt -f -p 10 $nop_pid
sleep 5

echo "change back to fair"
sudo chrt -o -p 0 $nop_pid
sleep 5

echo "unlimit $CG_PATH1"
sudo sh -c "echo max 100000 > $CG_PATH1/cpu.max"
sleep 5

kill $nop_pid

note: nop is a cpu hog that does: while (1) spin();

> - A script that tried to mimic a large cgroup setup is used to see how
> bad it is to unthrottle cfs_rqs and enqueue back large number of tasks
> in hrtime context.

Here are the test scripts:

CG_ROOT=/sys/fs/cgroup

nr_level1=2
nr_level2=100
nr_level3=10

for i in `seq $nr_level1`; do
CG_LEVEL1=$CG_ROOT/$i
echo "cg_level1: $CG_LEVEL1"
[ -d $CG_LEVEL1 ] || sudo mkdir -p $CG_LEVEL1
sudo sh -c "echo +cpu > $CG_LEVEL1/cgroup.subtree_control"

for j in `seq $nr_level2`; do
CG_LEVEL2=$CG_LEVEL1/${i}_$j
echo "cg_level2: $CG_LEVEL2"
[ -d $CG_LEVEL2 ] || sudo mkdir -p $CG_LEVEL2
sudo sh -c "echo +cpu > $CG_LEVEL2/cgroup.subtree_control"

for k in `seq $nr_level3`; do
CG_LEVEL3=$CG_LEVEL2/${i}_${j}_$k
[ -d $CG_LEVEL3 ] || sudo mkdir -p $CG_LEVEL3
~/test/run_in_cg.sh $CG_LEVEL3
done
done
done

function set_quota()
{
quota=$1

for i in `seq $nr_level1`; do
CG_LEVEL1=$CG_ROOT/$i
sudo sh -c "echo $quota 100000 > $CG_LEVEL1/cpu.max"
echo "$CG_LEVEL1: `cat $CG_LEVEL1/cpu.max`"
done
}

while true; do
echo "sleep 20"
sleep 20

echo "set 20cpu quota to first level cgroups"
set_quota 2000000
echo "sleep 20"
sleep 20

echo "set 10cpu quota to first level cgroups"
set_quota 1000000
echo "sleep 20"
sleep 20

echo "set 5cpu quota to first level cgroups"
set_quota 500000
echo "sleep 20"
sleep 20

echo "unlimit first level cgroups"
set_quota max
done

run_in_cg.sh:

set -e

CG_PATH=$1
[ -z "$CG_PATH" ] && {
echo "need cgroup path"
exit
}

echo "CG_PATH: $CG_PATH"

sudo sh -c "echo $$ > $CG_PATH/cgroup.procs"

for i in `seq 10`; do
~/src/misc/nop &
done

> The test was done on a 2sockets/384threads AMD CPU with the following
> cgroup setup: 2 first level cgroups with quota setting, each has 100
> child cgroups and each child cgroup has 10 leaf child cgroups, with a
> total number of 2000 cgroups. In each leaf child cgroup, 10 cpu hog
> tasks are created there. Below is the durations of
> distribute_cfs_runtime() during a 1 minute window:

@durations:
[8K, 16K) 274 |@@@@@@@@@@@@@@@@@@@@@ |
[16K, 32K) 132 |@@@@@@@@@@ |
[32K, 64K) 6 | |
[64K, 128K) 0 | |
[128K, 256K) 2 | |
[256K, 512K) 0 | |
[512K, 1M) 117 |@@@@@@@@@ |
[1M, 2M) 665 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2M, 4M) 10 | |

The bpftrace script used to capture this:

kfunc:distribute_cfs_runtime
{
@start[args->cfs_b] = nsecs;
}

kretfunc:distribute_cfs_runtime
{
if (@start[args->cfs_b]) {
$duration = nsecs - @start[args->cfs_b];
@durations = hist($duration);
delete(@start[args->cfs_b]);
}
}

interval:s:60
{
exit();
}

> So the biggest duration is in 2-4ms range in this hrtime context. How
> bad is this number? I think it is acceptable but maybe the setup I
> created is not complex enough?
> In older kernels where async unthrottle is not available, the largest
> time range can be about 100ms+.