[PATCH 0/7][RESEND] Fix cpu imbalance with equal weighted groups

From: Josef Bacik
Date: Fri Jul 14 2017 - 09:22:39 EST


(sorry to anybody who got this already, fb email is acting weird and ate the
linux-kernel submission so I have no idea who got this and who didn't.)

Hello,

In testing stacked services we noticed that if you started a normal CPU heavy
application in one cgroup, and a cpu stress test in another cgroup of equal
weight, the cpu stress group would get significantly more cpu time, usually
around 50% more.

Peter fixed some load propagation issues for Tejun a few months ago, and they
fixed the latency issues that Tejun was seeing, however they regressed this
imbalance problem more, so the cpu stress group was now getting more like 70%
more CPU time.

The following patches are to fix the regression introduced by Peter's patches
and then to fix the imbalance itself. Part of the imbalance fix is from Peter's
propagation patches, we just needed the runnable weight to be calculated
differently to fix the regression.

Essentially what happens is the "stress" group has tasks that never leave the
CPU, so the load average and runnable load average skews towards their
load.weight. However the interactive tasks obviously go on and off the CPU,
resulting in a lower load average. With Peter's changes to use the runnable
load average more this exacerbated the problem.

To solve this problem I've done a few things. First we use the max of the
weight or average for our cfs_rq weight calculations. This allows tasks that
have a lower load average but a higher weight to have an appropriate effect on
the cfs_rq when enqueue'ing.

The second part of the fix is to fix how we decide to do wake affinity. If I
simply disabled wake affinity and had the other patches the imbalance
disappeared as well. Fixing the wake affinity involves a few things.

First we need to change effective_load() to re-calculate the historic weight in
addition to the new weight with the new process. This is because simply using
our old weight/load_avg would be inaccurate if the load_avg for the task_group
had changed at all since we calculated our load. In practice this meant that
effective_load would often (almost always for my testcase) return a negative
delta for adding the process to the given CPU. This meant we always did wake
affine, even though the load on the current CPU was too high.

Those patches get us 95% there, the final patch is probably the more
controversial one, but brings us to complete balance between the two groups.
One thing that was observed was we would wake affine, and then promptly load
balance things off of the CPU that we woke to. You'd see tasks bounce around
CPU's constantly. So to avoid this thrashing record the last time we were load
balanced, and wait HZ duration before allowing a affinity wake up to occur.
This reduced the thrashing quite a bit, and brought our CPU usage to equality.

I have a stripped down reproducer here

https://github.com/josefbacik/debug-scripts/tree/master/unbalanced-reproducer

unbalanced.sh uses the cgroup2 interface which requires Tejun's cgroup2 cpu
controller patch, and unbalanced-v1.sh uses the old cgroupsv1 interface, and
assumes you have cpuacct,cpu mounted at /sys/fs/cgroup/cpuacct. You also need
rt-app installed.

Thanks,

Josef