Re: [Question] sched:the load is unbalanced in the VM overcommitment scenario

From: Waiman Long
Date: Fri Sep 13 2024 - 13:17:34 EST


On 9/13/24 00:03, zhengzucheng wrote:
In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8 CPUs are overcommitted to 2 x 8u VMs,
and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs resources, the other VM has 6 CPUs.
The host is configured with 80 CPUs in a sched domain and other CPUs are in the idle state.
The root cause is that the load of the host is unbalanced, some vCPUs exclusively occupy CPU resources.
when the CPU that triggers load balance calculates imbalance value, env->imbalance = 0 is calculated because of
local->avg_load > sds->avg_load. As a result, the load balance fails.
The processing logic: https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70


It's normal from kernel load balance, but it's not reasonable from the perspective of VM users.
In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule domain to fix it.
Is there any other method to fix this problem? thanks.

Abstracted reproduction case:
1.environment information:

[root@localhost ~]# cat /proc/schedstat

cpu0
domain0 00000000,00000000,00010000,00000000,00000001
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
cpu1
domain0 00000000,00000000,00020000,00000000,00000002
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
cpu2
domain0 00000000,00000000,00040000,00000000,00000004
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
cpu3
domain0 00000000,00000000,00080000,00000000,00000008
domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff

2.test case:

vcpu.c
#include <stdio.h>
#include <unistd.h>

int main()
{
        sleep(20);
        while (1);
        return 0;
}

gcc vcpu.c -o vcpu
-----------------------------------------------------------------
test.sh

#!/bin/bash

#vcpu1
mkdir /sys/fs/cgroup/cpuset/vcpu_1
echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
for i in {1..8}
do
        ./vcpu &
        pid=$!
        sleep 1
        echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
done

#vcpu2
mkdir /sys/fs/cgroup/cpuset/vcpu_2
echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
for i in {1..8}
do
        ./vcpu &
        pid=$!
        sleep 1
        echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
done
------------------------------------------------------------------
[root@localhost ~]# ./test.sh

[root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)

14591 root      20   0    2448   1012    928 R 100.0   0.0 13:10.73 ./vcpu
14582 root      20   0    2448   1012    928 R 100.0   0.0 13:12.71 ./vcpu
14606 root      20   0    2448    872    784 R 100.0   0.0 13:09.72 ./vcpu
14620 root      20   0    2448    916    832 R 100.0   0.0 13:07.72 ./vcpu
14622 root      20   0    2448    920    836 R 100.0   0.0 13:06.72 ./vcpu
14629 root      20   0    2448    920    832 R 100.0   0.0 13:05.72 ./vcpu
14643 root      20   0    2448    924    836 R  21.0   0.0 2:37.13 ./vcpu
14645 root      20   0    2448    868    784 R  21.0   0.0 2:36.51 ./vcpu
14589 root      20   0    2448    900    816 R  20.0   0.0 2:45.16 ./vcpu
14608 root      20   0    2448    956    872 R  20.0   0.0 2:42.24 ./vcpu
14632 root      20   0    2448    872    788 R  20.0   0.0 2:38.08 ./vcpu
14638 root      20   0    2448    924    840 R  20.0   0.0 2:37.48 ./vcpu
14652 root      20   0    2448    928    844 R  20.0   0.0 2:36.42 ./vcpu
14654 root      20   0    2448    924    840 R  20.0   0.0 2:36.14 ./vcpu
14663 root      20   0    2448    900    816 R  20.0   0.0 2:35.38 ./vcpu
14669 root      20   0    2448    868    784 R  20.0   0.0 2:35.70 ./vcpu

Your script creates two cpusets with the same set of CPUs. The scheduling aspect of the tasks, however, are not controlled by cpuset. It is controlled by cpu cgroup. I suppose that all these tasks are in the same cpu cgroup. It is possible that commit you mentioned might have caused some unfairness in allocating CPU time to different processes within the same cpu cgroup. Maybe you can try to put them into separate cpu cgroups as well with equal weight to see if that can improve the scheduling fairness?

BTW, you don't actually need to use 2 different cpusets if they all get the same set of CPUs and memory nodes. Also setting cpuset.sched_load_balance=0 may not actually get what you want unless all the cpusets that use those CPUs have cpuset.sched_load_balance set 0 including the root cgroup. Turning off this flag may disable load balancing, but it may not guarantee fairness depending on what CPUs are being used by those tasks when they start unless you explicitly assign the CPUs to them when starting these tasks.

Cheers,
Longman