Re: [PATCH v3 09/12] Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED"

From: Morten Rasmussen
Date: Fri Jul 11 2014 - 12:13:56 EST

Next message: H. Peter Anvin: "Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter"
Previous message: Peter Zijlstra: "[PATCH] Remove soon to be dead email address"
In reply to: Dietmar Eggemann: "Re: [PATCH v3 09/12] Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED""
Next in thread: Vincent Guittot: "Re: [PATCH v3 09/12] Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Jul 11, 2014 at 08:51:06AM +0100, Vincent Guittot wrote:
> On 10 July 2014 15:16, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
> >> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
> >>
> >> We are going to use runnable_avg_sum and runnable_avg_period in order to get
> >> the utilization of the CPU. This statistic includes all tasks that run the CPU
> >> and not only CFS tasks.
> >
> > But this rq->avg is not the one that is migration aware, right? So why
> > use this?
>
> Yes, it's not the one that is migration aware
>
> >
> > We already compensate cpu_capacity for !fair tasks, so I don't see why
> > we can't use the migration aware one (and kill this one as Yuyang keeps
> > proposing) and compensate with the capacity factor.
>
> The 1st point is that cpu_capacity is compensated by both !fair_tasks
> and frequency scaling and we should not take into account frequency
> scaling for detecting overload
>
> What we have now is the the weighted load avg that is the sum of the
> weight load of entities on the run queue. This is not usable to detect
> overload because of the weight. An unweighted version of this figure
> would be more usefull but it's not as accurate as the one I use IMHO.

IMHO there is no perfect utilization metric, but I think it is
fundamentally wrong to use a metric that is migration unaware to make
migration decisions. I mentioned that during the last review as well. It
is like having a very fast controller with a really slow (large delay)
feedback loop. There is a high risk of getting an unstable balance when
you load-balance rate is faster than the feedback delay.

> The example that has been discussed during the review of the last
> version has shown some limitations
>
> With the following schedule pattern from Morten's example
>
> | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> A: run rq run ----------- sleeping ------------- run
> B: rq run rq run ---- sleeping ------------- rq
>
> The scheduler will see the following values:
> Task A unweighted load value is 47%
> Task B unweight load is 60%
> The maximum Sum of unweighted load is 104%
> rq->avg load is 60%
>
> And the real CPU load is 50%
>
> So we will have opposite decision depending of the used values: the
> rq->avg or the Sum of unweighted load
>
> The sum of unweighted load has the main advantage of showing
> immediately what will be the relative impact of adding/removing a
> task. In the example, we can see that removing task A or B will remove
> around half the CPU load but it's not so good for giving the current
> utilization of the CPU

You forgot to mention the issues with rq->avg that were brought up last
time :-)

Here is an load-balancing example:

Task A, B, C, and D are all running/runnable constantly. To avoid
decimals we assume the sched tick to have a 9 ms period. We have four
cpus in a single sched_domain.

rq == rq->avg
uw == unweighted tracked load

cpu0:
| 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
A: run rq rq
B: rq run rq
C: rq rq run
D: rq rq rq run run run run run run
rq: 100% 100% 100% 100% 100% 100% 100% 100% 100%
uw: 400% 400% 400% 100% 100% 100% 100% 100% 100%

cpu1:
| 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
A: run rq run rq run rq
B: rq run rq run rq run
C:
D:
rq: 0% 0% 0% 0% 6% 12% 18% 23% 28%
uw: 0% 0% 0% 200% 200% 200% 200% 200% 200%

cpu2:
| 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
A:
B:
C: run run run run run run
D:
rq: 0% 0% 0% 0% 6% 12% 18% 23% 28%
uw: 0% 0% 0% 100% 100% 100% 100% 100% 100%

cpu3:
| 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms |
A:
B:
C:
D:
rq: 0% 0% 0% 0% 0% 0% 0% 0% 0%
uw: 0% 0% 0% 0% 0% 0% 0% 0% 0%

A periodic load-balance occurs on cpu1 after 9 ms. cpu0 rq->avg
indicates overload. Consequently cpu1 pulls task A and B.

Shortly after (<1 ms) cpu2 does a periodic load-balance. cpu0 rq->avg
hasn't changed so cpu0 still appears overloaded. cpu2 pulls task C.

Shortly after (<1 ms) cpu3 does a periodic load-balance. cpu0 rq->avg
still indicates overload so cpu3 tries to pull tasks but fails since
there is only task D left.

9 ms later the sched tick causes periodic load-balances on all the cpus.
cpu0 rq->avg still indicates that it has the highest load since cpu1
rq->avg has not had time to indicate overload. Consequently cpu1, 2,
and 3 will try to pull from that and fail. The balance will only change
once cpu1 rq->avg has increased enough to indicate overload.

Unweighted load will on the other hand indicate the load changes
instantaneously, so cpu3 would observe the overload of cpu1 immediately
and pull task A or B.

In this example using rq->avg leads to imbalance whereas unweighted load
would not. Correct me if I missed anything.

Coming back to the previous example. I'm not convinced that inflation of
the unweighted load sum when tasks overlap in time is a bad thing. I
have mentioned this before. The average cpu utilization over the 40ms
period is 50%. However the true compute capacity demand is 200% for the
first 15ms of the period, 100% for the next 5ms and 0% for the remaining
25ms. The cpu is actually overloaded for 15ms every 40ms. This fact is
factored into the unweighted load whereas rq->avg would give you the
same utilization no matter if the tasks are overlapped or not. Hence
unweighted load would give us an indication that the mix of tasks isn't
optimal even if the cpu has spare cycles.

If you don't care about overlap and latency, the unweighted sum of task
running time (that Peter has proposed a number of times) is better
metric, IMHO. As long the cpu isn't fully utilized.

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: H. Peter Anvin: "Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter"
Previous message: Peter Zijlstra: "[PATCH] Remove soon to be dead email address"
In reply to: Dietmar Eggemann: "Re: [PATCH v3 09/12] Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED""
Next in thread: Vincent Guittot: "Re: [PATCH v3 09/12] Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]