group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)

From: Christian Borntraeger
Date: Mon Sep 26 2016 - 06:42:40 EST


Folks,

I have seen big scalability degredations sind 4.3 (bisected 9d89c257d
sched/fair: Rewrite runnable load and utilization average tracking)
This has not been fixed by subsequent patches,e.g. the ones that try to
fix this for interactive workload.

The problem is only visible for sleep/wakeup heavy workload which must
be part of the scheduler group (e.g. a sysbench OLTP inside a KVM guest
as libvirt will put KVM guests into cgroup instances).

For example a simple sysbench oltp with mysql inside a KVM guests with
16 CPUs backed by 8 host cpus (16 host threads) scales less (scale up
inside a guest, having multiple instances). This is the numbers of
events per second.
Unmounting /sys/fs/cgroup/cpu,cpuacct (thus forcing libvirt to not
use group scheduling for KVM guests) makes the behaviour much better:


instances group nogroup
1 3406 3002
2 5078 4940
3 6017 6760
4 6471 8216 (+27%)
5 6716 9196
6 6976 9783
7 7127 10170
8 7399 10385 (+40%)

before 9d89c257d ("sched/fair: Rewrite runnable load and utilization
average tracking") there was basically no difference between group
or non-group scheduling. These numbers are with 4.7, older kernels after
9d89c257d show a similar difference.

The bad thing is that there is a lot of idle cpu power in the host
when this happens so the scheduler seems to not realize that this
workload could use more cpus in the host.

I tried some experiments , but I have not found a hack that "fixes" the
degredation, which would give me an indication which part of the code
is broken. So are there any ideas? Is the estimated group load
calculation just not fast enough for sleep/wakeup workload?

Christian