Re: [Question] sched:the load is unbalanced in the VM overcommitment scenario

From: Vincent Guittot
Date: Fri Sep 13 2024 - 11:55:46 EST


On Fri, 13 Sept 2024 at 06:03, zhengzucheng <zhengzucheng@xxxxxxxxxx> wrote:
>
> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8
> CPUs are overcommitted to 2 x 8u VMs,
> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs
> resources, the other VM has 6 CPUs.
> The host is configured with 80 CPUs in a sched domain and other CPUs are
> in the idle state.
> The root cause is that the load of the host is unbalanced, some vCPUs
> exclusively occupy CPU resources.
> when the CPU that triggers load balance calculates imbalance value,
> env->imbalance = 0 is calculated because of
> local->avg_load > sds->avg_load. As a result, the load balance fails.
> The processing logic:
> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
>
>
> It's normal from kernel load balance, but it's not reasonable from the
> perspective of VM users.
> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule
> domain to fix it.
> Is there any other method to fix this problem? thanks.

I'm not sure how to understand your setup and why the load balance is
not balancing correctly 16 vCPU between the 8 CPUs.

>From your test case description below, you have 8 always running
threads in cgroup A and 8 always running threads in cgroup B and the 2
cgroups have only 8 CPUs among 80. This should not be a problem for
load balance. I tried something similar although not exactly the same
with cgroupv2 and rt-app and I don't have noticeable imbalance

Do you have more details that you can share about your system ?

Which kernel version are you using ? Which arch ?

>
> Abstracted reproduction case:
> 1.environment information:
>
> [root@localhost ~]# cat /proc/schedstat
>
> cpu0
> domain0 00000000,00000000,00010000,00000000,00000001
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu1
> domain0 00000000,00000000,00020000,00000000,00000002
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu2
> domain0 00000000,00000000,00040000,00000000,00000004
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu3
> domain0 00000000,00000000,00080000,00000000,00000008
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff

Is it correct to assume that domain0 is SMT, domain1 MC and domain2 PKG ?
and cpu80-83 are in the other group of PKG ? and LLC is at domain1 level ?

>
> 2.test case:
>
> vcpu.c
> #include <stdio.h>
> #include <unistd.h>
>
> int main()
> {
> sleep(20);
> while (1);
> return 0;
> }
>
> gcc vcpu.c -o vcpu
> -----------------------------------------------------------------
> test.sh
>
> #!/bin/bash
>
> #vcpu1
> mkdir /sys/fs/cgroup/cpuset/vcpu_1
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
> for i in {1..8}
> do
> ./vcpu &
> pid=$!
> sleep 1
> echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
> done
>
> #vcpu2
> mkdir /sys/fs/cgroup/cpuset/vcpu_2
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
> for i in {1..8}
> do
> ./vcpu &
> pid=$!
> sleep 1
> echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
> done
> ------------------------------------------------------------------
> [root@localhost ~]# ./test.sh
>
> [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
>
> 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73 ./vcpu
> 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71 ./vcpu
> 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72 ./vcpu
> 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72 ./vcpu
> 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72 ./vcpu
> 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72 ./vcpu
> 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu
> 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu
> 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu
> 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu
> 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu
> 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu
> 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu
> 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu
> 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu
> 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu
>