Re: [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
From: Vincent Guittot
Date: Tue Apr 25 2017 - 04:36:32 EST
Hi Tejun
On 24 April 2017 at 22:14, Tejun Heo <tj@xxxxxxxxxx> wrote:
> 09a43ace1f98 ("sched/fair: Propagate load during synchronous
> attach/detach") added immediate load propagation from cfs_rq to its
> sched_entity then to the parent cfs_rq; however, what gets propagated
> doesn't seem to make sense.
>
> It repeats the tg_weight calculation done in calc_cfs_shares() but
> only uses it to compensate for shares being out of date. After that,
> it sets the sched_entity's load_avg to the load_avg of the
> corresponding cfs_rq.
>
> This doesn't make sense as the cfs_rq's load_avg is some fraction of
> its total weight, which the sched_entity's weight has nothing to with.
> For example, if the cfs_rq has a single constant load 1 task the
> cfs_rq's load_avg would be around 1. If that cfs_rq is the only
> active sched_entity in the parent cfs_rq which has the maximum weight,
> the sched_entity's load should be around the maximum weight but
> update_tg_cfs_load() ends up overriding it to 1.
not sure to catch your example:
a task TA with a load_avg = 1 is the only task in a task group GB so
the cfs_rq load_avg = 1 too and the group_entity of this cfs_rq has
got a weight of 1024 (I use 10bits format for readability) which is
the total share of task group GB
Are you saying that the group_entity load_avg should be around 1024 and not 1 ?
I would say it depends of TA weight. I assume that TA weight is the
default value (1024) as you don't specify any value in your example
If TA directly runs at parent level, its sched_entity would have a
load_avg of 1 so why the group entity load_avg should be 1024 ? it
will just temporally show the cfs_rq more loaded than it is really and
at the end the group entity load_avg will go back to 1
>
> At the parent's level, the absolute value of load_avg inside a child
> cfs_rq doesn't mean anything. Only the ratio against its weight is
> meaningful.
>
> This patch changes update_tg_cfs_load() to normalize the
> runnable_load_avg of the cfs_rq and then scale it to the matching
> sched_entity's freshly calculated shares for propagation. Use of
> runnable_load_avg instead of load_avg is intentional and keeps the
> parent's runnable_load_avg true to the sum of scaled loads of all
> tasks queued under it which is critical for the correction operation
> of load balancer. The next patch will depend on it.
>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Mike Galbraith <efault@xxxxxx>
> Cc: Paul Turner <pjt@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 46 +++++++++++++++++++---------------------------
> 1 file changed, 19 insertions(+), 27 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3078,37 +3078,29 @@ static inline void
> update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> struct cfs_rq *gcfs_rq = group_cfs_rq(se);
> - long delta, load = gcfs_rq->avg.load_avg;
> + long load = 0, delta;
>
> /*
> - * If the load of group cfs_rq is null, the load of the
> - * sched_entity will also be null so we can skip the formula
> + * A cfs_rq's load avg contribution to the parent should be scaled
> + * to the sched_entity's weight. Use freshly calculated shares
> + * instead of @se->load.weight as the latter may not reflect
> + * changes from the current scheduling operation.
> + *
> + * Note that the propagation source is runnable_load_avg instead of
> + * load_avg. This keeps every cfs_rq's runnable_load_avg true to
> + * the sum of the scaled loads of all tasks queued under it, which
> + * is important for the correct operation of the load balancer.
> + *
> + * This can make the sched_entity's load_avg jumpier but that
> + * correctly reflects what would happen without cgroups if each
> + * task's load is scaled across nesting - the load is being
> + * averaged at the task and each cfs_rq.
> */
> - if (load) {
> - long tg_load;
> + if (gcfs_rq->load.weight) {
> + long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
>
> - /* Get tg's load and ensure tg_load > 0 */
> - tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
> -
> - /* Ensure tg_load >= load and updated with current load*/
> - tg_load -= gcfs_rq->tg_load_avg_contrib;
> - tg_load += load;
> -
> - /*
> - * We need to compute a correction term in the case that the
> - * task group is consuming more CPU than a task of equal
> - * weight. A task with a weight equals to tg->shares will have
> - * a load less or equal to scale_load_down(tg->shares).
> - * Similarly, the sched_entities that represent the task group
> - * at parent level, can't have a load higher than
> - * scale_load_down(tg->shares). And the Sum of sched_entities'
> - * load must be <= scale_load_down(tg->shares).
> - */
> - if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
> - /* scale gcfs_rq's load into tg's shares*/
> - load *= scale_load_down(gcfs_rq->tg->shares);
> - load /= tg_load;
> - }
> + load = min(gcfs_rq->runnable_load_avg *
> + shares / gcfs_rq->load.weight, shares);
> }
>
> delta = load - se->avg.load_avg;