Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vs unpinned
From: Paul Turner
Date: Tue Jun 07 2011 - 23:09:48 EST
[ Sorry for the delayed response, I was out on vacation for the second
half of May until last week -- I've now caught up on email and am
preparing the next posting ]
Thanks for the test-case Kamalesh -- my immediate suspicion is quota
return may not be fine-grained enough (although the numbers provided
are large enough it's possible there's also just a bug).
I have some tools from my own testing I can use to pull this apart,
let me run your work-load and get back to you.
On Tue, Jun 7, 2011 at 8:45 AM, Kamalesh Babulal
<kamalesh@xxxxxxxxxxxxxxxxxx> wrote:
> Hi All,
>
> In our test environment, while testing the CFS Bandwidth V6 patch set
> on top of 55922c9d1b84. We observed that the CPU's idle time is seen
> between 30% to 40% while running CPU bound test, with the cgroups tasks
> not pinned to the CPU's. Whereas in the inverse case, where the cgroups
> tasks are pinned to the CPU's, the idle time seen is nearly zero.
>
> Test Scenario
> --------------
> - 5 cgroups are created with each groups assigned 2, 2, 4, 8, 16 tasks respectively.
> - Each of the cgroup, has N sub-cgroups created. Where N is the NR_TASKS the cgroup
> is assigned with. i.e., cgroup1, will create two sub-cgroups under it and assigned
> one tasks per sub-group.
> ------------
> | cgroup 1 |
> ------------
> / \
> / \
> -------------- --------------
> |sub-cgroup 1| |sub-cgroup 2|
> | (task 1) | | (task 2) |
> -------------- --------------
>
> - Top cgroup is given unlimited quota (cpu.cfs_quota_us = -1) and period of 500ms
> (cpu.cfs_period_us = 500000). Whereas the sub-cgroups are given 250ms of quota
> (cpu.cfs_quota_us = 250000) and period of 500ms. i.e. the top cgroups are given
> unlimited bandwidth, whereas the sub-group are throttled every 250ms.
>
> - Additional if required the proportional CPU shares can be assigned to cpu.shares
> as NR_TASKS * 1024. i.e. cgroup1 has 2 tasks * 1024 = 2048 worth cpu.shares
> for cgroup1. (In the below test results published all cgroups and sub-cgroups
> are given the equal share of 1024).
>
> - One CPU bound while(1) task is attached to each sub-cgroup.
>
> - sum-exec time for each cgroup/sub-cgroup is captured from /proc/sched_debug after
> 60 seconds and analyzed for the run time of the tasks a.k.a sub-cgroup.
>
> How is the idle CPU time measured ?
> ------------------------------------
> - vmstat stats are logged every 2 seconds, after attaching the last while1 task
> to 16th sub-cgroup of cgroup 5 till the 60 sec run is over. After the run idle%
> of a CPU is calculated by summing idle column from the vmstat log and dividing it
> by number of samples collected, of-course after neglecting the first record
> from the log.
>
> How are the tasks pinned to the CPU ?
> -------------------------------------
> - cgroup is mounted with cpuset,cpu controller and for every 2 sub-cgroups one
> physical CPU is allocated. i.e. CPU 1 is allocated between 1/1 and 1/2 (Group 1,
> sub-cgroup 1 and sub-cgroup 2). Similarly CPUs 7 to 15 are allocated to 15/1 to
> 15/16 (Group 15, subgroup 1 to 16). Note that test machine used to test has
> 16 CPUs.
>
> Result for non-pining case
> ---------------------------
> Only the hierarchy is created as stated above and cpusets are not assigned per cgroup.
>
> Average CPU Idle percentage 34.8% (as explained above in the Idle time measured)
> Bandwidth shared with remaining non-Idle 65.2%
>
> * Note: For the sake of roundoff value the numbers are multiplied by 100.
>
> In the below result for cgroup1 9.2500 corresponds to sum-exec time captured
> from /proc/sched_debug for cgroup 1 tasks (including sub-cgroup 1 and 2).
> Which is in-turn 6% of the non-Idle CPU time (which is derived by 9.2500 * 65.2 / 100 )
>
> Bandwidth of Group 1 = 9.2500 i.e = 6.0300% of non-Idle CPU time 65.2%
> |...... subgroup 1/1 = 48.7800 i.e = 2.9400% of 6.0300% Groups non-Idle CPU time
> |...... subgroup 1/2 = 51.2100 i.e = 3.0800% of 6.0300% Groups non-Idle CPU time
>
>
> Bandwidth of Group 2 = 9.0400 i.e = 5.8900% of non-Idle CPU time 65.2%
> |...... subgroup 2/1 = 51.0200 i.e = 3.0000% of 5.8900% Groups non-Idle CPU time
> |...... subgroup 2/2 = 48.9700 i.e = 2.8800% of 5.8900% Groups non-Idle CPU time
>
>
> Bandwidth of Group 3 = 16.9300 i.e = 11.0300% of non-Idle CPU time 65.2%
> |...... subgroup 3/1 = 26.0300 i.e = 2.8700% of 11.0300% Groups non-Idle CPU time
> |...... subgroup 3/2 = 25.8800 i.e = 2.8500% of 11.0300% Groups non-Idle CPU time
> |...... subgroup 3/3 = 22.7800 i.e = 2.5100% of 11.0300% Groups non-Idle CPU time
> |...... subgroup 3/4 = 25.2900 i.e = 2.7800% of 11.0300% Groups non-Idle CPU time
>
>
> Bandwidth of Group 4 = 27.9300 i.e = 18.2100% of non-Idle CPU time 65.2%
> |...... subgroup 4/1 = 16.6000 i.e = 3.0200% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/2 = 8.0000 i.e = 1.4500% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/3 = 9.0000 i.e = 1.6300% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/4 = 7.9600 i.e = 1.4400% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/5 = 12.3500 i.e = 2.2400% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/6 = 16.2500 i.e = 2.9500% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/7 = 12.6100 i.e = 2.2900% of 18.2100% Groups non-Idle CPU time
> |...... subgroup 4/8 = 17.1900 i.e = 3.1300% of 18.2100% Groups non-Idle CPU time
>
>
> Bandwidth of Group 5 = 36.8300 i.e = 24.0100% of non-Idle CPU time 65.2%
> |...... subgroup 5/1 = 56.6900 i.e = 13.6100% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/2 = 8.8600 i.e = 2.1200% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/3 = 5.5100 i.e = 1.3200% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/4 = 4.5700 i.e = 1.0900% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/5 = 7.9500 i.e = 1.9000% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/6 = 2.1600 i.e = .5100% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/7 = 2.3400 i.e = .5600% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/8 = 2.1500 i.e = .5100% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/9 = 9.7200 i.e = 2.3300% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/10 = 5.0600 i.e = 1.2100% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/11 = 4.6900 i.e = 1.1200% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/12 = 8.9700 i.e = 2.1500% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/13 = 8.4600 i.e = 2.0300% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/14 = 11.8400 i.e = 2.8400% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/15 = 6.3400 i.e = 1.5200% of 24.0100% Groups non-Idle CPU time
> |...... subgroup 5/16 = 5.1500 i.e = 1.2300% of 24.0100% Groups non-Idle CPU time
>
> Pinned case
> --------------
> CPU hierarchy is created and cpusets are allocated.
>
> Average CPU Idle percentage 0%
> Bandwidth shared with remaining non-Idle 100%
>
> Bandwidth of Group 1 = 6.3400 i.e = 6.3400% of non-Idle CPU time 100%
> |...... subgroup 1/1 = 50.0400 i.e = 3.1700% of 6.3400% Groups non-Idle CPU time
> |...... subgroup 1/2 = 49.9500 i.e = 3.1600% of 6.3400% Groups non-Idle CPU time
>
>
> Bandwidth of Group 2 = 6.3200 i.e = 6.3200% of non-Idle CPU time 100%
> |...... subgroup 2/1 = 50.0400 i.e = 3.1600% of 6.3200% Groups non-Idle CPU time
> |...... subgroup 2/2 = 49.9500 i.e = 3.1500% of 6.3200% Groups non-Idle CPU time
>
>
> Bandwidth of Group 3 = 12.6300 i.e = 12.6300% of non-Idle CPU time 100%
> |...... subgroup 3/1 = 25.0300 i.e = 3.1600% of 12.6300% Groups non-Idle CPU time
> |...... subgroup 3/2 = 25.0100 i.e = 3.1500% of 12.6300% Groups non-Idle CPU time
> |...... subgroup 3/3 = 25.0000 i.e = 3.1500% of 12.6300% Groups non-Idle CPU time
> |...... subgroup 3/4 = 24.9400 i.e = 3.1400% of 12.6300% Groups non-Idle CPU time
>
>
> Bandwidth of Group 4 = 25.1000 i.e = 25.1000% of non-Idle CPU time 100%
> |...... subgroup 4/1 = 12.5400 i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/2 = 12.5100 i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/3 = 12.5300 i.e = 3.1400% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/4 = 12.5000 i.e = 3.1300% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/5 = 12.4900 i.e = 3.1300% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/6 = 12.4700 i.e = 3.1200% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/7 = 12.4700 i.e = 3.1200% of 25.1000% Groups non-Idle CPU time
> |...... subgroup 4/8 = 12.4500 i.e = 3.1200% of 25.1000% Groups non-Idle CPU time
>
>
> Bandwidth of Group 5 = 49.5700 i.e = 49.5700% of non-Idle CPU time 100%
> |...... subgroup 5/1 = 49.8500 i.e = 24.7100% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/2 = 6.2900 i.e = 3.1100% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/3 = 6.2800 i.e = 3.1100% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/4 = 6.2700 i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/5 = 6.2700 i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/6 = 6.2600 i.e = 3.1000% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/7 = 6.2500 i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/8 = 6.2400 i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/9 = 6.2400 i.e = 3.0900% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/10 = 6.2300 i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/11 = 6.2300 i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/12 = 6.2200 i.e = 3.0800% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/13 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/14 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/15 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
> |...... subgroup 5/16 = 6.2100 i.e = 3.0700% of 49.5700% Groups non-Idle CPU time
>
> with equal cpu shares allocated to all the groups/sub-cgroups and CFS bandwidth configured
> to allow 100% CPU utilization. We see the CPU idle time in the un-pinned case.
>
> Benchmark used to reproduce the issue, is attached. Justing executing the script should
> report similar numbers.
>
> #!/bin/bash
>
> NR_TASKS1=2
> NR_TASKS2=2
> NR_TASKS3=4
> NR_TASKS4=8
> NR_TASKS5=16
>
> BANDWIDTH=1
> SUBGROUP=1
> PRO_SHARES=0
> MOUNT=/cgroup/
> LOAD=/root/while1
>
> usage()
> {
> echo "Usage $0: [-b 0|1] [-s 0|1] [-p 0|1]"
> echo "-b 1|0 set/unset Cgroups bandwidth control (default set)"
> echo "-s Create sub-groups for every task (default creates sub-group)"
> echo "-p create propotional shares based on cpus"
> exit
> }
> while getopts ":b:s:p:" arg
> do
> case $arg in
> b)
> BANDWIDTH=$OPTARG
> shift
> if [ $BANDWIDTH -gt 1 ] && [ $BANDWIDTH -lt 0 ]
> then
> usage
> fi
> ;;
> s)
> SUBGROUP=$OPTARG
> shift
> if [ $SUBGROUP -gt 1 ] && [ $SUBGROUP -lt 0 ]
> then
> usage
> fi
> ;;
> p)
> PRO_SHARES=$OPTARG
> shift
> if [ $PRO_SHARES -gt 1 ] && [ $PRO_SHARES -lt 0 ]
> then
> usage
> fi
> ;;
>
> *)
>
> esac
> done
> if [ ! -d $MOUNT ]
> then
> mkdir -p $MOUNT
> fi
> test()
> {
> echo -n "[ "
> if [ $1 -eq 0 ]
> then
> echo -ne '\E[42;40mOk'
> else
> echo -ne '\E[31;40mFailed'
> tput sgr0
> echo " ]"
> exit
> fi
> tput sgr0
> echo " ]"
> }
> mount_cgrp()
> {
> echo -n "Mounting root cgroup "
> mount -t cgroup -ocpu,cpuset,cpuacct none $MOUNT &> /dev/null
> test $?
> }
>
> umount_cgrp()
> {
> echo -n "Unmounting root cgroup "
> cd /root/
> umount $MOUNT
> test $?
> }
>
> create_hierarchy()
> {
> mount_cgrp
> cpuset_mem=`cat $MOUNT/cpuset.mems`
> cpuset_cpu=`cat $MOUNT/cpuset.cpus`
> echo -n "creating groups/sub-groups ..."
> for (( i=1; i<=5; i++ ))
> do
> mkdir $MOUNT/$i
> echo $cpuset_mem > $MOUNT/$i/cpuset.mems
> echo $cpuset_cpu > $MOUNT/$i/cpuset.cpus
> echo -n ".."
> if [ $SUBGROUP -eq 1 ]
> then
> jj=$(eval echo "\$NR_TASKS$i")
> for (( j=1; j<=$jj; j++ ))
> do
> mkdir -p $MOUNT/$i/$j
> echo $cpuset_mem > $MOUNT/$i/$j/cpuset.mems
> echo $cpuset_cpu > $MOUNT/$i/$j/cpuset.cpus
> echo -n ".."
> done
> fi
> done
> echo "."
> }
>
> cleanup()
> {
> pkill -9 while1 &> /dev/null
> sleep 10
> echo -n "Umount groups/sub-groups .."
> for (( i=1; i<=5; i++ ))
> do
> if [ $SUBGROUP -eq 1 ]
> then
> jj=$(eval echo "\$NR_TASKS$i")
> for (( j=1; j<=$jj; j++ ))
> do
> rmdir $MOUNT/$i/$j
> echo -n ".."
> done
> fi
> rmdir $MOUNT/$i
> echo -n ".."
> done
> echo " "
> umount_cgrp
> }
>
> load_tasks()
> {
> for (( i=1; i<=5; i++ ))
> do
> jj=$(eval echo "\$NR_TASKS$i")
> shares="1024"
> if [ $PRO_SHARES -eq 1 ]
> then
> eval shares=$(echo "$jj * 1024" | bc)
> fi
> echo $hares > $MOUNT/$i/cpu.shares
> for (( j=1; j<=$jj; j++ ))
> do
> echo "-1" > $MOUNT/$i/cpu.cfs_quota_us
> echo "500000" > $MOUNT/$i/cpu.cfs_period_us
> if [ $SUBGROUP -eq 1 ]
> then
>
> $LOAD &
> echo $! > $MOUNT/$i/$j/tasks
> echo "1024" > $MOUNT/$i/$j/cpu.shares
>
> if [ $BANDWIDTH -eq 1 ]
> then
> echo "500000" > $MOUNT/$i/$j/cpu.cfs_period_us
> echo "250000" > $MOUNT/$i/$j/cpu.cfs_quota_us
> fi
> else
> $LOAD &
> echo $! > $MOUNT/$i/tasks
> echo $shares > $MOUNT/$i/cpu.shares
>
> if [ $BANDWIDTH -eq 1 ]
> then
> echo "500000" > $MOUNT/$i/cpu.cfs_period_us
> echo "250000" > $MOUNT/$i/cpu.cfs_quota_us
> fi
> fi
> done
> done
> echo "Captuing idle cpu time with vmstat...."
> vmstat 2 100 &> vmstat_log &
> }
>
> pin_tasks()
> {
> cpu=0
> count=1
> for (( i=1; i<=5; i++ ))
> do
> if [ $SUBGROUP -eq 1 ]
> then
> jj=$(eval echo "\$NR_TASKS$i")
> for (( j=1; j<=$jj; j++ ))
> do
> if [ $count -gt 2 ]
> then
> cpu=$((cpu+1))
> count=1
> fi
> echo $cpu > $MOUNT/$i/$j/cpuset.cpus
> count=$((count+1))
> done
> else
> case $i in
> 1)
> echo 0 > $MOUNT/$i/cpuset.cpus;;
> 2)
> echo 1 > $MOUNT/$i/cpuset.cpus;;
> 3)
> echo "2-3" > $MOUNT/$i/cpuset.cpus;;
> 4)
> echo "4-6" > $MOUNT/$i/cpuset.cpus;;
> 5)
> echo "7-15" > $MOUNT/$i/cpuset.cpus;;
> esac
> fi
> done
>
> }
>
> print_results()
> {
> eval gtot=$(cat sched_log|grep -i while|sed 's/R//g'|awk '{gtot+=$7};END{printf "%f", gtot}')
> for (( i=1; i<=5; i++ ))
> do
> eval temp=$(cat sched_log_$i|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
> eval tavg=$(echo "scale=4;(($temp / $gtot) * $1)/100 " | bc)
> eval avg=$(echo "scale=4;($temp / $gtot) * 100" | bc)
> eval pretty_tavg=$( echo "scale=4; $tavg * 100"| bc) # F0r pretty format
> echo "Bandwidth of Group $i = $avg i.e = $pretty_tavg% of non-Idle CPU time $1%"
> if [ $SUBGROUP -eq 1 ]
> then
> jj=$(eval echo "\$NR_TASKS$i")
> for (( j=1; j<=$jj; j++ ))
> do
> eval tmp=$(cat sched_log_$i-$j|sed 's/R//g'| awk '{gtot+=$7};END{printf "%f",gtot}')
> eval stavg=$(echo "scale=4;($tmp / $temp) * 100" | bc)
> eval pretty_stavg=$(echo "scale=4;(($tmp / $temp) * $tavg) * 100" | bc)
> echo -n "|"
> echo -e "...... subgroup $i/$j\t= $stavg\ti.e = $pretty_stavg% of $pretty_tavg% Groups non-Idle CPU time"
> done
> fi
> echo " "
> echo " "
> done
> }
> capture_results()
> {
> cat /proc/sched_debug > sched_log
> pkill -9 vmstat -c
> avg=$(cat vmstat_log |grep -iv "system"|grep -iv "swpd"|awk ' { if ( NR != 1) {id+=$15 }}END{print (id/NR)}')
>
> rem=$(echo "scale=2; 100 - $avg" |bc)
> echo "Average CPU Idle percentage $avg%"
> echo "Bandwidth shared with remaining non-Idle $rem%"
> for (( i=1; i<=5; i++ ))
> do
> cat sched_log |grep -i while1|grep -i " \/$i" > sched_log_$i
> if [ $SUBGROUP -eq 1 ]
> then
> jj=$(eval echo "\$NR_TASKS$i")
> for (( j=1; j<=$jj; j++ ))
> do
> cat sched_log |grep -i while1|grep -i " \/$i\/$j" > sched_log_$i-$j
> done
> fi
> done
> print_results $rem
> }
> create_hierarchy
> pin_tasks
>
> load_tasks
> sleep 60
> capture_results
> cleanup
> exit
>
> Thanks,
> Kamalesh.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/