Group Imbalance bug - performance drop upto factor 10x

From: Jirka Hladky
Date: Mon Feb 06 2017 - 18:38:03 EST


we observe that group imbalance bug can cause performance degradation
upto factor 10x on 4 NUMA server.

I have opened Bug 194231
for this issue.

The problem was first described in this paper

in chapter 3.1. Scheduler is not correctly balancing load on 4 NUMA
node server in the following scenario:
* there are three independent ssh connections
* first two ssh connections are running single threaded CPU intensive workload
* last ssh session is running multi-threaded application which
requires almost all cores in the system.

We have used
* stress --cpu 1 as single threaded CPU intensive workload
* lu.C.x benchmark from NAS Parallel Benchmarks suite as
multi-threaded application

Version-Release number of selected component (if applicable):
Reproduced on

kernel 4.10.0-0.rc6

How reproducible:

It requires at least 2 NUMA server. Problem gets worse on 4 NUMA server.

Steps to Reproduce:
1. start 3 ssh connections to server
2. in first two ssh connections run stress --cpu 1
3. in the third ssh connection run lu.C.x benchmark with number of
threads equal to number of CPUs in the system minus 4.
4. run either Intel's numatop
echo "N" | numatop -d log >/dev/null 2>&1 &
or mpstat -P ALL 5 and check the load distribution across the NUMA
nodes. mpstat output can be processed by utility to
aggregate data across NUMA nodes

mpstat -P ALL 5 | --lscpu <(lscpu)

5. Compare the results against the same workload started from ONE ssh
session (all processes are in one group)

Actual results:

Uneven load across NUMA nodes:
Average: NODE %usr %idle
Average: all 66.12 33.51
Average: 0 37.97 61.74
Average: 1 31.67 68.15
Average: 2 97.50 1.98
Average: 3 97.33 2.19

Please notice that while number of CPU intensive threads is 62 on this
64 CPU system, NUMA nodes #0 and #1 are underutilized.

Real runtime in seconds for lu.C.x benchmark went up from 114 seconds
to 846 seconds!

Expected results:

Load evenly balanced across all NUMA nodes. Real runtime for lu.C.x
benchmark same regardless if jobs were started from one ssh session or
from multiply ssh sessions.

Additional info:

as proposal for the patch for kernel 4.1.

I will upload a reproduced to the Bug report

Thanks a lot!