Re: [PATCH v2 08/11] sched: get CPU's activity statistic

From: Morten Rasmussen
Date: Wed May 28 2014 - 11:47:18 EST


On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
> >> Monitor the activity level of each group of each sched_domain level. The
> >> activity is the amount of cpu_power that is currently used on a CPU or group
> >> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> >> level. In the special use case where the CPU is fully loaded by more than 1
> >> task, the activity level is set above the cpu_power in order to reflect the
> >> overload of the CPU
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> >> ---
> >> kernel/sched/fair.c | 22 ++++++++++++++++++++++
> >> 1 file changed, 22 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b7c51be..c01d8b6 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
> >> return cpu_rq(cpu)->cpu_power;
> >> }
> >>
> >> +static unsigned long power_orig_of(int cpu)
> >> +{
> >> + return cpu_rq(cpu)->cpu_power_orig;
> >> +}
> >> +
> >> static unsigned long cpu_avg_load_per_task(int cpu)
> >> {
> >> struct rq *rq = cpu_rq(cpu);
> >> @@ -4438,6 +4443,18 @@ done:
> >> return target;
> >> }
> >>
> >> +static int get_cpu_activity(int cpu)
> >> +{
> >> + struct rq *rq = cpu_rq(cpu);
> >> + u32 sum = rq->avg.runnable_avg_sum;
> >> + u32 period = rq->avg.runnable_avg_period;
> >> +
> >> + if (sum >= period)
> >> + return power_orig_of(cpu) + rq->nr_running - 1;
> >> +
> >> + return (sum * power_orig_of(cpu)) / period;
> >> +}
> >
> > The rq runnable_avg_{sum, period} give a very long term view of the cpu
> > utilization (I will use the term utilization instead of activity as I
> > think that is what we are talking about here). IMHO, it is too slow to
> > be used as basis for load balancing decisions. I think that was also
> > agreed upon in the last discussion related to this topic [1].
> >
> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
>
> I agree that it's not really responsive but several statistics of the
> scheduler use the same kind of metrics and have the same kind of
> responsiveness.

I might be wrong, but I don't think we use anything similar to this to
estimate cpu load/utilization for load-balancing purposes except for
{source, target}_load() where it is used to bias the decisions whether
or not to balance if the difference is small. That is what the
discussion was about last time.

> I agree that it's not enough and that's why i'm not using only this
> metric but it gives information that the unweighted load_avg_contrib
> (that you are speaking about below) can't give. So i would be less
> contrasted than you and would say that we probably need additional
> metrics

I'm not saying that we shouldn't this metric at all, I'm just saying
that I don't think it is suitable for estimating the short term view cpu
utilization which is what you need to make load-balancing decisions. We
can't observe the effect of recent load-balancing decisions if the
metric is too slow to react.

I realize that what I mean by 'slow' might be unclear. Load tracking
(both task and rq) takes a certain amount of history into account in
runnable_avg_{sum, period}. This amount is determined by the 'y'-weight,
which has been chosen such that we consider the load in the past 345
time units, where the time unit is ~1 ms. The contribution is smaller
the further you go back due to y^n, which diminishes to 0 for n > 345.
So, if a task or cpu goes from having been idle for >345 ms to being
constantly busy, it will take 345 ms until the entire history that we
care about will reflect this change. Only then runnable_avg_sum will
reach 47742. The rate of change is faster to begin with since the weight
of the most recent history is higher. runnable_avg_sum will get to
47742/2 in just 32 ms.

Since we may do periodic load-balance every 10 ms or so, we will perform
a number of load-balances where runnable_avg_sum will mostly be
reflecting the state of the world before a change (new task queued or
moved a task to a different cpu). If you had have two tasks continuously
on one cpu and your other cpu is idle, and you move one of the tasks to
the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
first cpu while it starts from 0 on the other one. 10 ms later it will
have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
it reaches 47742. In the mean time the cpu doesn't appear fully utilized
and we might decide to put more tasks on it because we don't know if
runnable_avg_sum represents a partially utilized cpu (for example a 50%
task) or if it will continue to rise and eventually get to 47742.

IMO, we need cpu utilization to clearly represent the current
utilization of the cpu such that any changes since the last wakeup or
load-balance are clearly visible.

>
> >
> > Also, if load-balancing actually moves tasks around it may take quite a
> > while before runnable_avg_sum actually reflects this change. The next
> > periodic load-balance is likely to happen before runnable_avg_sum has
> > reflected the result of the previous periodic load-balance.
>
> runnable_avg_sum uses a 1ms unit step so i tend to disagree with your
> point above

See explanation above. The important thing is how much history we take
into account. That is 345x 1 ms time units. The rate at which the sum is
updated doesn't have any change anything. 1 ms after a change (wakeup,
load-balance,...) runnable_avg_sum can only change by 1024. The
remaining ~98% of your weighted history still reflects the world before
the change.

> > To avoid these problems, we need to base utilization on a metric which
> > is updated instantaneously when we add/remove tasks to a cpu (or a least
> > fast enough that we don't see the above problems). In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
>
> Regarding the unweighted load_avg_contrib, you will have similar issue
> because of the slowness in the variation of each sched_entity load
> that will be added/removed in the unweighted load_avg_contrib.
>
> The update of the runnable_avg_{sum,period} of an sched_entity is
> quite similar to cpu utilization.

Yes, runnable_avg_{sum, period} for tasks and rqs are exactly the same.
No difference there :)

> This value is linked to the CPU on
> which it has run previously because of the time sharing with others
> tasks, so the unweighted load of a freshly migrated task will reflect
> its load on the previous CPU (with the time sharing with other tasks
> on prev CPU).

I agree that the task runnable_avg_sum is always affected by the
circumstances on the cpu where it is running, and that it takes this
history with it. However, I think cfs.runnable_load_avg leads to less
problems than using the rq runnable_avg_sum. It would work nicely for
the two tasks on two cpus example I mentioned earlier. We don't need add
something on top when the cpu is fully utilized by more than one task.
It comes more naturally with cfs.runnable_load_avg. If it is much larger
than 47742, it should be fairly safe to assume that you shouldn't stick
more tasks on that cpu.

>
> I'm not saying that such metric is useless but it's not perfect as well.

It comes with its own set of problems, agreed. Based on my current
understanding (or lack thereof) they just seem smaller :)

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/