Re: [RFC PATCH] sched: reflect sched_entity movement into task_group's utilization

From: Dietmar Eggemann
Date: Wed May 11 2016 - 10:45:51 EST


Hi Vincent,

On 04/05/16 08:17, Vincent Guittot wrote:
> Ensure that changes of the utilization of a sched_entity will be
> reflected in the task_group hierarchy down to the root cfs.
>
> This patch tries another way than the flat utilization hierarchy proposal to
> ensure that the changes will be propagated down to the root cfs.

IMHO, the biggest advantage over the flat utilization hierarchy approach
is that you don't have to sync the sched_avg::last_update_time values
between se's and cfs_rq's.

> The way to compute the sched average metrics stays the same so the utilization
> only need to be synced with the local cfs rq timestamp.
>
> We keep an estimate of the utilization of each task group which is not used
> for now but might be usefull in the future even if i don't have idea so far.

A simple test case (a 50% task (run/period 8/16ms) switching between 2
cpus every 160ms (due to restricted cpu affinity to one of these cpus)
and running in tg_root/tg_1) shows the aimed behaviour at the root
cfs_rq (immediate utilization drop from ~550 to 0 on the src cpu and
immediate utilization rise from 0 to ~550 on the dst cpu in case of a
migration from src to dst cpu.

But in case I run the same task in tg_root/tg_1/tg_11 , then the
utilization of the root cfs_rq looks like the one of a system w/o the
patch. The problem seems to be that cfs_rq->diff_util_avg (owned by
tg_root/tg_1 on cpuX) is not 0 (instead it has ~ -470) in case the task
gets enqueued on cpuX. So the utilization signal of root cfs_rq ramps up
slowly and doesn't drop in case the task migrates to the other cpu.

[...]

> +/*
> + * Save how much utilization has just been added/removed on cfs rq so we can
> + * propagate it across the whole tg tree
> + */
> +static void set_tg_cfs_util(struct cfs_rq *cfs_rq, int delta)

Maybe s/cfs_cfs_rq ?

> +{
> + if (!cfs_rq->tg)

I guess here you want to bail if cfs_rq is the root cfs_rq. The current
condition will always be true if CONFIG_FAIR_GROUP_SCHED is set. For the
root cfs_rq cfs->tg is equal to &root_task_group.

Since you don't need diff_util_avg on the root cfs_rq, you could use
cfs_rq->tg == &root_task_group or &rq->cfs == cfs_rq or
!cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))]

It doesn't harm functionality though, it's just the fact that you update
cfs_rq->diff_util_avg for the root cfs_rq needlessly.

> + return;
> +
> + cfs_rq->diff_util_avg += delta;
> +}
> +
> +/*
> + * Look at pending utilization change in the cfs rq and reflect it in the
> + * sched_entity that represents the cfs rq at parent level

Not sure if I understand the 'parent level' here correctly. For me, this
se, the cfs_rq it 'owns' (se->my_q or group_cfs_rq(se)) and the tg
(cfs_rq->tg) are at the same level.

se->parent, se->cfs_rq (or cfs_rq_of(se)), cfs_rq->tg->parent would be
the parent level.

So for me update_tg_se_util() operates in one level and the update of
the parent level would happen in the caller functon, e.g.
attach_entity_load_avg().

> + */
> +static int update_tg_se_util(struct sched_entity *se)
> +{
> + struct cfs_rq *cfs_rq;
> + long update_util_avg;
> + long old_util_avg;
> +
> + if (entity_is_task(se))
> + return 0;
> +
> + cfs_rq = se->my_q;

Minor thing, Couldn't you use group_cfs_rq(se) here?

> +
> + update_util_avg = cfs_rq->diff_util_avg;
> +
> + if (!update_util_avg)
> + return 0;
> +
> + cfs_rq->diff_util_avg = 0;
> +
> + old_util_avg = se->avg.util_avg;
> + se->avg.util_avg = max_t(long, se->avg.util_avg + update_util_avg, 0);
> + se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
> +
> + return se->avg.util_avg - old_util_avg;
> +}
> +
> +
> +/* Take into account the change of the utilization of a child task group */
> +static void update_tg_cfs_util(struct sched_entity *se)
> +{
> + int delta;
> + struct cfs_rq *cfs_rq;
> +
> + if (!se)
> + return;

This condition is only necessary for the call from
update_blocked_averages() for a root cfs_rq, I guess?

[...]

> @@ -6260,6 +6348,8 @@ static void update_blocked_averages(int cpu)
>
> if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true))
> update_tg_load_avg(cfs_rq, 0);
> + /* Propagate pending util changes to the parent */
> + update_tg_cfs_util(cfs_rq->tg->se[cpu_of(rq)]);

Couldn't cpu_of(rq)] not just be cpu?

[...]