Re: [Question] sched:the load is unbalanced in the VM overcommitment scenario
From: Vincent Guittot
Date: Tue Sep 17 2024 - 02:19:27 EST
On Sat, 14 Sept 2024 at 09:04, zhengzucheng <zhengzucheng@xxxxxxxxxx> wrote:
>
>
> 在 2024/9/13 23:55, Vincent Guittot 写道:
> > On Fri, 13 Sept 2024 at 06:03, zhengzucheng <zhengzucheng@xxxxxxxxxx> wrote:
> >> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8
> >> CPUs are overcommitted to 2 x 8u VMs,
> >> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs
> >> resources, the other VM has 6 CPUs.
> >> The host is configured with 80 CPUs in a sched domain and other CPUs are
> >> in the idle state.
> >> The root cause is that the load of the host is unbalanced, some vCPUs
> >> exclusively occupy CPU resources.
> >> when the CPU that triggers load balance calculates imbalance value,
> >> env->imbalance = 0 is calculated because of
> >> local->avg_load > sds->avg_load. As a result, the load balance fails.
> >> The processing logic:
> >> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
> >>
> >>
> >> It's normal from kernel load balance, but it's not reasonable from the
> >> perspective of VM users.
> >> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule
> >> domain to fix it.
> >> Is there any other method to fix this problem? thanks.
> > I'm not sure how to understand your setup and why the load balance is
> > not balancing correctly 16 vCPU between the 8 CPUs.
> >
> > >From your test case description below, you have 8 always running
> > threads in cgroup A and 8 always running threads in cgroup B and the 2
> > cgroups have only 8 CPUs among 80. This should not be a problem for
> > load balance. I tried something similar although not exactly the same
> > with cgroupv2 and rt-app and I don't have noticeable imbalance
> >
> > Do you have more details that you can share about your system ?
> >
> > Which kernel version are you using ? Which arch ?
>
> kernel version: 6.11.0-rc7
> arch: X86_64 and cgroup v1
okay
>
> >> Abstracted reproduction case:
> >> 1.environment information:
> >>
> >> [root@localhost ~]# cat /proc/schedstat
> >>
> >> cpu0
> >> domain0 00000000,00000000,00010000,00000000,00000001
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> >> cpu1
> >> domain0 00000000,00000000,00020000,00000000,00000002
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> >> cpu2
> >> domain0 00000000,00000000,00040000,00000000,00000004
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> >> cpu3
> >> domain0 00000000,00000000,00080000,00000000,00000008
> >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> > Is it correct to assume that domain0 is SMT, domain1 MC and domain2 PKG ?
> > and cpu80-83 are in the other group of PKG ? and LLC is at domain1 level ?
>
> domain0 is SMT and domain1 is MC
> thread_siblings_list:0,80. 1,81. 2,82. 3,83
Yeah, I should have read more carefully the domain0 cpumask
> LLC is at domain1 level
>
> >> 2.test case:
> >>
> >> vcpu.c
> >> #include <stdio.h>
> >> #include <unistd.h>
> >>
> >> int main()
> >> {
> >> sleep(20);
> >> while (1);
> >> return 0;
> >> }
> >>
> >> gcc vcpu.c -o vcpu
> >> -----------------------------------------------------------------
> >> test.sh
> >>
> >> #!/bin/bash
> >>
> >> #vcpu1
> >> mkdir /sys/fs/cgroup/cpuset/vcpu_1
> >> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
> >> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
> >> for i in {1..8}
> >> do
> >> ./vcpu &
> >> pid=$!
> >> sleep 1
> >> echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
> >> done
> >>
> >> #vcpu2
> >> mkdir /sys/fs/cgroup/cpuset/vcpu_2
> >> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
> >> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
> >> for i in {1..8}
> >> do
> >> ./vcpu &
> >> pid=$!
> >> sleep 1
> >> echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
> >> done
> >> ------------------------------------------------------------------
> >> [root@localhost ~]# ./test.sh
> >>
> >> [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
> >>
> >> 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73 ./vcpu
> >> 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71 ./vcpu
> >> 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72 ./vcpu
> >> 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72 ./vcpu
> >> 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72 ./vcpu
> >> 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72 ./vcpu
> >> 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu
> >> 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu
> >> 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu
> >> 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu
> >> 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu
> >> 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu
> >> 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu
> >> 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu
> >> 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu
> >> 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu
> >>
So I finally understood your situation. The limited cpuset screws up
the avg load of system for domain1. The group_imbalanced state is
there to try to fix an imbalanced situation related to tasks that are
pinned to a subset of CPUs of the sched domain. But this can't cover
all cases.
> > .