Group Imbalance bug - performance drop upto factor 10x
From: Jirka Hladky
Date: Mon Feb 06 2017 - 18:38:03 EST
Hello,
we observe that group imbalance bug can cause performance degradation
upto factor 10x on 4 NUMA server.
I have opened Bug 194231
https://bugzilla.kernel.org/show_bug.cgi?id=194231
for this issue.
The problem was first described in this paper
http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
in chapter 3.1. Scheduler is not correctly balancing load on 4 NUMA
node server in the following scenario:
* there are three independent ssh connections
* first two ssh connections are running single threaded CPU intensive workload
* last ssh session is running multi-threaded application which
requires almost all cores in the system.
We have used
* stress --cpu 1 as single threaded CPU intensive workload
http://people.seas.harvard.edu/~apw/stress/
and
* lu.C.x benchmark from NAS Parallel Benchmarks suite as
multi-threaded application
https://www.nas.nasa.gov/publications/npb.html
Version-Release number of selected component (if applicable):
Reproduced on
kernel 4.10.0-0.rc6
How reproducible:
It requires at least 2 NUMA server. Problem gets worse on 4 NUMA server.
Steps to Reproduce:
1. start 3 ssh connections to server
2. in first two ssh connections run stress --cpu 1
3. in the third ssh connection run lu.C.x benchmark with number of
threads equal to number of CPUs in the system minus 4.
4. run either Intel's numatop
echo "N" | numatop -d log >/dev/null 2>&1 &
or mpstat -P ALL 5 and check the load distribution across the NUMA
nodes. mpstat output can be processed by mpstat2node.py utility to
aggregate data across NUMA nodes
https://github.com/jhladka/MPSTAT2NODE/blob/master/mpstat2node.py
mpstat -P ALL 5 | mpstat2node.py --lscpu <(lscpu)
5. Compare the results against the same workload started from ONE ssh
session (all processes are in one group)
Actual results:
Uneven load across NUMA nodes:
Average: NODE %usr %idle
Average: all 66.12 33.51
Average: 0 37.97 61.74
Average: 1 31.67 68.15
Average: 2 97.50 1.98
Average: 3 97.33 2.19
Please notice that while number of CPU intensive threads is 62 on this
64 CPU system, NUMA nodes #0 and #1 are underutilized.
Real runtime in seconds for lu.C.x benchmark went up from 114 seconds
to 846 seconds!
Expected results:
Load evenly balanced across all NUMA nodes. Real runtime for lu.C.x
benchmark same regardless if jobs were started from one ssh session or
from multiply ssh sessions.
Additional info:
See
https://github.com/jplozi/wastedcores/blob/master/patches/group_imbalance_linux_4.1.patch
as proposal for the patch for kernel 4.1.
I will upload a reproduced to the Bug report
https://bugzilla.kernel.org/show_bug.cgi?id=194231
Thanks a lot!
Jirka