Re: [PATCH -v2 02/18] sched/fair: Add comment to calc_cfs_shares()

From: Morten Rasmussen
Date: Thu Sep 28 2017 - 06:03:19 EST


On Fri, Sep 01, 2017 at 03:21:01PM +0200, Peter Zijlstra wrote:
> Explain the magic equation in calc_cfs_shares() a bit better.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 61 insertions(+)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2707,6 +2707,67 @@ account_entity_dequeue(struct cfs_rq *cf
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> # ifdef CONFIG_SMP
> +/*
> + * All this does is approximate the hierarchical proportion which includes that
> + * global sum we all love to hate.
> + *
> + * That is, the weight of a group entity, is the proportional share of the
> + * group weight based on the group runqueue weights. That is:
> + *
> + * tg->weight * grq->load.weight
> + * ge->load.weight = ----------------------------- (1)
> + * \Sum grq->load.weight
> + *
> + * Now, because computing that sum is prohibitively expensive to compute (been
> + * there, done that) we approximate it with this average stuff. The average
> + * moves slower and therefore the approximation is cheaper and more stable.
> + *
> + * So instead of the above, we substitute:
> + *
> + * grq->load.weight -> grq->avg.load_avg (2)
> + *
> + * which yields the following:
> + *
> + * tg->weight * grq->avg.load_avg
> + * ge->load.weight = ------------------------------ (3)
> + * tg->load_avg
> + *
> + * Where: tg->load_avg ~= \Sum grq->avg.load_avg
> + *
> + * That is shares_avg, and it is right (given the approximation (2)).
> + *
> + * The problem with it is that because the average is slow -- it was designed
> + * to be exactly that of course -- this leads to transients in boundary
> + * conditions. In specific, the case where the group was idle and we start the
> + * one task. It takes time for our CPU's grq->avg.load_avg to build up,
> + * yielding bad latency etc..
> + *
> + * Now, in that special case (1) reduces to:
> + *
> + * tg->weight * grq->load.weight
> + * ge->load.weight = ----------------------------- = tg>weight (4)
> + * grp->load.weight

Should it be "grq->load.weight" in the denominator of (4)?
And "tg->weight" at the end?

> + *
> + * That is, the sum collapses because all other CPUs are idle; the UP scenario.

Shouldn't (3) collapse in the same way too in this special case? In
theory it should reduce to:

tg->weight * grq->avg.load_avg
ge->load.weight = ------------------------------
grq->avg.load_avg


But I can see many reasons why it won't happen in practice if things
aren't perfectly up-to-date. If tg->load_avg and grq->avg.load_avg in
(3) aren't in sync, or there are stale contributions to tg->load_avg
from other cpus then (3) can return anything between 0 and tg->weight.

> + *
> + * So what we do is modify our approximation (3) to approach (4) in the (near)
> + * UP case, like:
> + *
> + * ge->load.weight =
> + *
> + * tg->weight * grq->load.weight
> + * --------------------------------------------------- (5)
> + * tg->load_avg - grq->avg.load_avg + grq->load.weight
> + *
> + *
> + * And that is shares_weight and is icky. In the (near) UP case it approaches
> + * (4) while in the normal case it approaches (3). It consistently
> + * overestimates the ge->load.weight and therefore:
> + *
> + * \Sum ge->load.weight >= tg->weight
> + *
> + * hence icky!

IIUC, if grq->avg.load_avg > grq->load.weight, i.e. you have blocked
tasks, you can end up with underestimating the ge->load.weight for some
of the group entities lead to \Sum ge->load.weight < tg->weight.

Let's take a simple example:

Two cpus, one task group with three tasks in it: An always-running task
on both cpus, and an additional periodic task currently blocked on cpu 0
(contributing 512 to grq->avg.load_avg on cpu 0).

tg->weight = 1024
tg->load_avg = 2560
\Sum grq->load.weight = 2048

cpu 0 1 \Sum
grq->avg.load_avg 1536 1024
grq->load.weight 1024 1024
ge->load_weight (1) 512 512 1024 >= tg->weight
ge->load_weight (3) 614 410 1024 >= tg->weight
ge->load_weight (5) 512 410 922 < tg->weight

So with (5) we are missing 102 worth of ge->load.weight.

If (1), the instantaneous ge->load.weight, is what we want, then
ge->load.weight of cpu 1 is underestimated, if (3), shares_avg, is the
goal, then ge->load.weight of cpu 0 is underestimated.

The "missing" ge->load.weight can get much larger if the blocked task
had higher priority.

Another thing is that we are loosing a bit of the nice stability that
(3) provides if you have periodic tasks.

I'm not sure if we can do better than (5), I'm just trying to understand
how the approximation will behave and make sure we understand the
implications.

Morten