Group Imbalance Bug - performance drop by factor 10x on NUMA boxes with cgroups

From: Jirka Hladky
Date: Sat Oct 27 2018 - 19:25:38 EST


Hi Mel and Srikar,

I would like to ask you if you could look into the Group Imbalance Bug
described in this paper

http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

in chapter 3.1. See also comment [1]. The paper describes the bug on
workload which involves different ssh sessions and it assumes
kernel.sched_autogroup_enabled=1. We have found out that it can be
reproduced more easily with cgroups.

Reproducer consists of this workload
* 2 separate "stress --cpu 1" processes. Each stress process needs 1 CPU.
* NAS benchmark (https://www.nas.nasa.gov/publications/npb.html) from
which I use lu.C.x binary (Lower-Upper Gauss-Seidel solver) in the
Open Multi-Processing (OMP) mode.

We run the workload in two modes:

NORMAL - both stress and lu.C.x are run in the same control group
GROUP - each binary is run in a separate control group:
stress, first instance: cpu:test_group_1
stress, seconds instance: cpu:test_group_2
lu.C.x : cpu:test_group_main

I run lu.C.x with a different number of threads - for example on 4
NUMA server with 4x Xeon Gold 6126 CPU (total 96 CPUs) - I run lu.C.x
with 72, 80, 88, and 92 threads. Since the server has 96 CPUs in
total, even with 92 threads for lu.C.x and two stress processes server
is still not fully loaded.

Here are the runtimes in seconds for lu.C.x for different number of threads

#Threads NORMAL GROUP
72 21.27 30.01
80 15.32 164
88 17.91 367
92 19.22 432

As you can see, already for 72 threads lu.C.x is significantly slower
when executed in dedicated cgroup. And it gets much worse with an
increasing number of threads (slowdown by the factor 10x and greater).

Some more details are below.

Please let me know if it sounds interesting and if you would like to
look into it. I can provide you with the reproducer plus some
supplementary python scripts to further analyze the results.

Thanks a lot!
Jirka

Some more details on the case with 80 threads for lu.C.x, 2 stress
processes run on 96 CPUs server with 4 NUMA nodes.

Analyzing ps output is very interesting (here for 5 subsequent runs of
the workload):
========================================================
Average number of threads scheduled for NUMA NODE 0 1 2 3
========================================================
lu.C.x_80_NORMAL_1.ps.numa.hist Average 21.25 21.00 19.75 18.00
lu.C.x_80_NORMAL_1.stress.ps.numa.hist Average 1.00 1.00
lu.C.x_80_NORMAL_2.ps.numa.hist Average 20.50 20.75 18.00 20.75
lu.C.x_80_NORMAL_2.stress.ps.numa.hist Average 1.00 0.75 0.25
lu.C.x_80_NORMAL_3.ps.numa.hist Average 21.75 22.00 18.75 17.50
lu.C.x_80_NORMAL_3.stress.ps.numa.hist Average 1.00 1.00
lu.C.x_80_NORMAL_4.ps.numa.hist Average 21.50 21.00 18.75 18.75
lu.C.x_80_NORMAL_4.stress.ps.numa.hist Average 1.00 1.00
lu.C.x_80_NORMAL_5.ps.numa.hist Average 18.00 23.33 19.33 19.33
lu.C.x_80_NORMAL_5.stress.ps.numa.hist Average 1.00 1.00


As you can see, in NORMAL mode lu.C.x is uniformly scheduled over NUMA nodes.

Compare it with cgroups mode:
========================================================
Average number of threads scheduled for NUMA NODE 0 1 2 3
========================================================
lu.C.x_80_GROUP_1.ps.numa.hist Average 13.05 13.54 27.65 25.76
lu.C.x_80_GROUP_1.stress.ps.numa.hist Average 1.00 1.00
lu.C.x_80_GROUP_2.ps.numa.hist Average 12.18 14.85 27.56 25.41
lu.C.x_80_GROUP_2.stress.ps.numa.hist Average 1.00 1.00
lu.C.x_80_GROUP_3.ps.numa.hist Average 15.32 13.23 26.52 24.94
lu.C.x_80_GROUP_3.stress.ps.numa.hist Average 1.00 1.00
lu.C.x_80_GROUP_4.ps.numa.hist Average 13.82 14.86 25.64 25.68
lu.C.x_80_GROUP_4.stress.ps.numa.hist Average 1.00 1.00
lu.C.x_80_GROUP_5.ps.numa.hist Average 15.12 13.03 25.12 26.73
lu.C.x_80_GROUP_5.stress.ps.numa.hist Average 1.00 1.00

In cgroup mode, the scheduler is moving lu.C.x away from the nodes #0
and #1 where stress processes are running. It does it to such extent
that NUMA nodes #2 and #3 are overcommitted - these NUMA nodes have
more NAS threads scheduled than CPUs available - there are 24 CPUs in
each NUMA node.

Here is the detailed report:
$more lu.C.x_80_GROUP_1.ps.numa.hist
#Date NUMA 0 NUMA 1 NUMA 2 NUMA 3
2018-Oct-27_04h39m57s 6 7 37 30
2018-Oct-27_04h40m02s 16 15 23 26
2018-Oct-27_04h40m08s 13 12 27 28
2018-Oct-27_04h40m13s 9 15 29 27
2018-Oct-27_04h40m18s 16 13 27 24
2018-Oct-27_04h40m23s 16 14 25 25
2018-Oct-27_04h40m28s 16 15 24 25
2018-Oct-27_04h40m33s 10 11 34 25
2018-Oct-27_04h40m38s 16 13 25 26
2018-Oct-27_04h40m43s 10 10 32 28
2018-Oct-27_04h40m48s 12 16 26 26
2018-Oct-27_04h40m53s 13 11 30 26
2018-Oct-27_04h40m58s 13 14 28 25
2018-Oct-27_04h41m03s 11 15 28 26
2018-Oct-27_04h41m08s 13 15 28 24
2018-Oct-27_04h41m13s 14 17 25 24
2018-Oct-27_04h41m18s 14 17 24 25
2018-Oct-27_04h41m24s 13 12 28 27
2018-Oct-27_04h41m29s 11 12 30 27
2018-Oct-27_04h41m34s 13 15 26 26
2018-Oct-27_04h41m39s 13 15 27 25
2018-Oct-27_04h41m44s 13 15 26 26
2018-Oct-27_04h41m49s 12 7 36 25
2018-Oct-27_04h41m54s 14 13 27 26
2018-Oct-27_04h41m59s 16 13 25 26
2018-Oct-27_04h42m04s 15 14 26 25
2018-Oct-27_04h42m09s 16 12 26 26
2018-Oct-27_04h42m14s 12 15 27 26
2018-Oct-27_04h42m19s 13 15 26 26
2018-Oct-27_04h42m24s 14 15 26 25
2018-Oct-27_04h42m29s 14 15 26 25
2018-Oct-27_04h42m34s 8 11 36 25
2018-Oct-27_04h42m39s 13 14 28 25
2018-Oct-27_04h42m45s 13 16 26 25
2018-Oct-27_04h42m50s 13 16 27 24
2018-Oct-27_04h42m55s 13 16 26 25
2018-Oct-27_04h43m00s 16 10 26 28
Average 13.05 13.54 27.65 25.76

Please notice that NODEs #3 and #4 have SIGNIFICANTLY (upto 36!!) more
threads scheduled than the number of available CPUs (24) while nodes
#0 and #1 have plenty of idle cores. I think that this is the best
illustration of the Group Imbalance bug - some cores are overcommitted
while others cores are idle and this disbalance is not getting any
better over time.


[1]
There are four bugs described in the paper. I have actively worked
over past two years on all of them; this is the current status: Group
Construction bug and Missing Scheduling Domains bugs are fixed. I was
not able to reproduce Overload on Wakeup bug (despite great effort). I
have created reproducer for the Group Imbalance Bug using the cgroups,
but it seems there is no easy fix.)