Re: [PATCH 1/3] sched/fair: Peter's shares_type patch
From: Peter Zijlstra
Date: Fri May 05 2017 - 11:41:40 EST
Hi all,
Sorry for the overly long post, this is a semi coherent collection
of all the notes I found on my desk and some continuations thereon.
On Fri, May 05, 2017 at 12:40:27PM +0200, Vincent Guittot wrote:
> > +enum shares_type {
> > + shares_runnable,
> > + shares_avg,
> > + shares_weight,
> > +};
> > +
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > # ifdef CONFIG_SMP
> > +static long
> > +calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg,
> > + enum shares_type shares_type)
> > {
> > + long tg_weight, tg_shares, load, shares;
> >
> > + tg_shares = READ_ONCE(tg->shares);
> > +
> > + switch (shares_type) {
> > + case shares_runnable:
> > + /*
> > + * Instead of the correct cfs_rq->avg.load_avg we use
> > + * cfs_rq->runnable_load_avg, which does not include the
> > + * blocked load.
> > + */
> > + load = cfs_rq->runnable_load_avg;
> > + break;
> > +
> > + case shares_avg:
> > + load = cfs_rq->avg.load_avg;
> > + break;
> > +
> > + case shares_weight:
> > + /*
> > + * Instead of the correct cfs_rq->avg.load_avg we use
> > + * cfs_rq->load.weight, which is its upper bound. This helps
> > + * ramp up the shares for small weight interactive tasks.
> > + */
> > + load = scale_load_down(cfs_rq->load.weight);
> > + break;
> > + }
> >
> > tg_weight = atomic_long_read(&tg->load_avg);
> >
> > + /*
> > + * This ensures the sum is up-to-date for this CPU, in case of the other
> > + * two approximations it biases the sum towards their value and in case
> > + * of (near) UP ensures the division ends up <= 1.
> > + */
> > tg_weight -= cfs_rq->tg_load_avg_contrib;
> > tg_weight += load;
>
> The above is correct for shares_avg and shares_weight but it's not
> correct for shares_runnable because runnable_load_avg is a subset of
> load_avg.
Its only correct for shares_avg. shares_weight and shares_runnable are
both icky.
Remember, all this does is approximate the hierarchical proportion which
includes that global sum we all love to hate.
That is, the weight of a group entity, is the proportional share of the
group weight based on the group runqueue weights. That is:
tg->weight * grq->load.weight
ge->load.weight = ----------------------------- (1)
\Sum grq->load.weight
Now, because computing that sum is prohibitively expensive to compute
(been there, done that) we approximate it with this average stuff. The
average moves slower and therefore the approximation is cheaper and
more stable.
So instead of the above, we substitute:
grq->load.weight -> grq->avg.load_avg (2)
which yields the following:
tg->weight * grq->avg.load_avg
ge->load.weight = ------------------------------ (3)
tg->load_avg
Because: tg->load_avg ~= \Sum grq->avg.load_avg
That is shares_avg, and it is right (given the approximation (2)).
The problem with it is that because the average is slow -- it was
designed to be exactly that of course -- this leads to transients in
boundary conditions. In specific, the case where the group was idle and
we start the one task. It takes time for our CPU's grq->avg.load_avg to
build up, yielding bad latency etc..
Now, in that special case (1) reduces to:
tg->weight * grq->load.weight
ge->load.weight = ----------------------------- = tg>weight (4)
grp->load.weight
That is, the sum collapses because all other CPUs are idle; the UP
scenario.
So what we do is modify our approximation (3) to approach (4) in the
(near) UP case, like:
ge->load.weight =
tg->weight * grq->load.weight
--------------------------------------------------- (5)
tg->load_avg - grq->avg.load_avg + grq->load.weight
And that is shares_weight and is icky. In the (near) UP case it
approaches (4) while in the normal case it approaches, but never
quite reaches, (3). I consistently overestimates the ge->load.weight
and therefore:
\Sum ge->load.weight >= tg->weight
hence icky!
Now, on to the propagate stuff.
I'm still struggling with this, so I might have gotten it completely
backwards. Bear with me.
> And i think that shares_avg is also wrong and should be something like:
>
> group_entity->avg.load_avg =
> cfs_rq->avg.load_avg * group_entity->load.weight / cfs_rq->load.weight
So firstly, no, shares_avg was intended to be (3) and that is right.
What you're looking for is an expression for ge->avg.load_avg, that's an
entirely different beast.
But lets take a few steps back.
The whole point of the propagation thing is that:
ge->avg == grq->avg (6)
_IFF_ we look at the pure running and runnable sums. Because they
represent the very same entity, just as different views.
But since we've just added an entity to grq->avg, we must also update
ge->avg to retain our invariant (6).
Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
simply copies the running sum over.
However, update_tg_cfs_load() is making my head hurt. So we have:
ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg (7)
And since, like util, the runnable part should be directly transferable,
the following would _appear_ to be the straight forward approach:
grq->avg.load_avg = grq->load.weight * grq->avg.running_avg (8)
And per (6) we have:
ge->avg.running_avg == grq->avg.running_avg
Which gives (6,7,8):
ge->load.weight * grq->avg.load_avg
ge->avg.load_avg = ----------------------------------- (9)
grq->load.weight
Except that is wrong!
Because while for entities historical weight is not important and we
really only care about our future and therefore can consider a pure
runnable sum, runqueues can NOT do this.
We specifically want runqueues to have a load_avg that includes
historical weights. Those represent the blocked load, the load we expect
to (shortly) return to us. This only works by keeping the weights as
integral part of the sum. We therefore cannot decompose as per (8).
OK, so what then?
The code (more or less) does:
tg->weight * grq->avg.load_avg
ge->avg.load_avg = ------------------------------
tg->load_avg
Which is basically (3), but since that was the right expression for
ge->load.weight, and per (7) that does not preserve the runnable sum,
its an upper bound and basically completely destroys the runnable part.
[
however, since we actually use (5), and (5) > (3) we get an effective
ge->avg.runnable_avg < 1. But still, afaict there's no relation to
the desired grq->avg.runnable_avg part.
]
You now propose (9), which as I've explained is wrong (also, trying to
implement it doesn't work, grq->load.weight will be 0 etc..).
Another way to look at things is:
grq->avg.load_avg = \Sum se->avg.load_avg
Therefore, per (7):
grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
And the very thing we're propagating is a change in that sum (someone
joined/left). So we can easily know the runnable change, if we were to
keep track of cfs_rq->removed_runnable_avg and
cfs_rq->added_runnable_avg for the migrate condition, which would be,
per (7) the already tracked se->load_avg divided by the corresponding
se->weight.
Basically (9) but in differential form:
d(runnable_avg) += se->avg.load_avg / se->load.weight
(10)
ge->avg.load_avg += ge->load.weight * d(runnable_avg)
OK, on to the whole runnable thing, which we should not confuse with the
propagation thing, it has very little to do with the invariant (6).
> In fact, if TA is the only task in the cgroup,
> tg->load_avg == cfs_rq->tg_load_avg_contrib.
> For shares_avg,
> tg_weight == cfs_rq->runnable_load_avg , and
> shares = cfs_rq->runnable_load_avg * tg->shares / cfs_rq->runnable_load_avg
> = tg->shares = 1024.
That was the intent. (Also, _please_, use whitespace, its cheap
and makes things _so_ much more readable).
grq->runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq
So its a subset of grq->avg.load_avg. Now, since each group entity
(can) represents multiple tasks, not all of which will be runnable at
the same time, our group entity should not contribute its entire
load_avg to its parent cfs_rq's runnable.
Here I think we can use a simple proportion:
grq->runnable_load_avg
ge->runnable_avg = ge->avg.load_avg * ----------------------
grq->avg.load_avg
Its a local property, afaict.
And since it needs to be updated on every enqueue/dequeue, we should do
it in {en,de}queue_entity_load_avg() or thereabout.
I've not thought much about time how this interacts with the regular
PELT() function and if this fraction is retained under it.
But at this point my brain is completely fried..
> For shares_runnable, it should be
>
> group_entity->runnable_load_avg =
> cfs_rq->runnable_load_avg * group_entity->avg.load_avg / cfs_rq->avg.load_avg
I agree with the equation.
> > @@ -3078,39 +3105,10 @@ static inline void
> > update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > {
> > struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> > - long delta, load = gcfs_rq->avg.load_avg;
> > -
> > - /*
> > - * If the load of group cfs_rq is null, the load of the
> > - * sched_entity will also be null so we can skip the formula
> > - */
> > - if (load) {
> > - long tg_load;
> > -
> > - /* Get tg's load and ensure tg_load > 0 */
> > - tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> > -
> > - /* Ensure tg_load >= load and updated with current load*/
> > - tg_load -= gcfs_rq->tg_load_avg_contrib;
> > - tg_load += load;
> > -
> > - /*
> > - * We need to compute a correction term in the case that the
> > - * task group is consuming more CPU than a task of equal
> > - * weight. A task with a weight equals to tg->shares will have
> > - * a load less or equal to scale_load_down(tg->shares).
> > - * Similarly, the sched_entities that represent the task group
> > - * at parent level, can't have a load higher than
> > - * scale_load_down(tg->shares). And the Sum of sched_entities'
> > - * load must be <= scale_load_down(tg->shares).
> > - */
> > - if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
> > - /* scale gcfs_rq's load into tg's shares*/
> > - load *= scale_load_down(gcfs_rq->tg->shares);
> > - load /= tg_load;
> > - }
> > - }
> > + long load, delta;
> >
> > + load = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg,
> > + shares_avg));
> > delta = load - se->avg.load_avg;
> >
> > /* Nothing to update */
So this part... I'm fairly sure both the old and new version is wrong.