Re: group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)

From: Christian Borntraeger
Date: Mon Sep 26 2016 - 07:42:29 EST

Next message: Baoyou Xie: "[PATCH] vhost: mark symbols static in vhost.c"
Previous message: Rafael J. Wysocki: "Re: Regression in 4.8 - CPU speed set very low"
In reply to: Peter Zijlstra: "Re: group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)"
Next in thread: Peter Zijlstra: "Re: group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 09/26/2016 12:56 PM, Peter Zijlstra wrote:
> On Mon, Sep 26, 2016 at 12:42:22PM +0200, Christian Borntraeger wrote:
>> Folks,
>>
>> I have seen big scalability degredations sind 4.3 (bisected 9d89c257d
>> sched/fair: Rewrite runnable load and utilization average tracking)
>> This has not been fixed by subsequent patches,e.g. the ones that try to
>> fix this for interactive workload.
>>
>> The problem is only visible for sleep/wakeup heavy workload which must
>> be part of the scheduler group (e.g. a sysbench OLTP inside a KVM guest
>> as libvirt will put KVM guests into cgroup instances).
>>
>> For example a simple sysbench oltp with mysql inside a KVM guests with
>> 16 CPUs backed by 8 host cpus (16 host threads) scales less (scale up
>> inside a guest, having multiple instances). This is the numbers of
>> events per second.
>> Unmounting /sys/fs/cgroup/cpu,cpuacct (thus forcing libvirt to not
>> use group scheduling for KVM guests) makes the behaviour much better:
>>
>>
>> instances group nogroup
>> 1 3406 3002
>> 2 5078 4940
>> 3 6017 6760
>> 4 6471 8216 (+27%)
>> 5 6716 9196
>> 6 6976 9783
>> 7 7127 10170
>> 8 7399 10385 (+40%)
>>
>> before 9d89c257d ("sched/fair: Rewrite runnable load and utilization
>> average tracking") there was basically no difference between group
>> or non-group scheduling. These numbers are with 4.7, older kernels after
>> 9d89c257d show a similar difference.
>>
>> The bad thing is that there is a lot of idle cpu power in the host
>> when this happens so the scheduler seems to not realize that this
>> workload could use more cpus in the host.
>>
>> I tried some experiments , but I have not found a hack that "fixes" the
>> degredation, which would give me an indication which part of the code
>> is broken. So are there any ideas? Is the estimated group load
>> calculation just not fast enough for sleep/wakeup workload?
>
> One of the differences in the old and new thing is being addressed by
> these patches:
>
> https://lkml.kernel.org/r/1473666472-13749-1-git-send-email-vincent.guittot@xxxxxxxxxx
>
> Could you see if those patches make a difference? If not, we'll have to
> go poke elsewhere ofcourse ;-)

Those patches do not apply cleanly on v4.7, linux/master or next/master.
Is there a good branch to test these patches?

Next message: Baoyou Xie: "[PATCH] vhost: mark symbols static in vhost.c"
Previous message: Rafael J. Wysocki: "Re: Regression in 4.8 - CPU speed set very low"
In reply to: Peter Zijlstra: "Re: group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)"
Next in thread: Peter Zijlstra: "Re: group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]