Re: [PATCH v9 2/4] sched: Rewrite runnable load and utilization average tracking

From: Dietmar Eggemann
Date: Mon Jul 13 2015 - 13:08:47 EST


Hi Yuyang,

I did some testing of your new pelt implementation.

TC 1: one nice-0 60% task affine to cpu1 in root tg and 2 nice-0 20%
periodic tasks affine to cpu1 in a task group with id=3 (one hierarchy).

TC 2: 10 nice-0 5% tasks affine to cpu1 in a task group with id=3 (one
hierarchy).

and compared the results (the se (tasks and tg representation for cpu1),
cfs_rq and tg related pelt signals) with the current pelt implementation.

The signals are very similar (taken the differences due to
separated/missing blocked load/util in the current pelt and the slightly
different behaviour in transitional phases (e.g. task enqueue/dequeue)
into consideration.

I haven't done any performance related tests yet.

-- Dietmar

On 23/06/15 01:08, Yuyang Du wrote:
> The idea of runnable load average (let runnable time contribute to weight)
> was proposed by Paul Turner, and it is still followed by this rewrite. This
> rewrite aims to solve the following issues:

[...]

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index af0eeba..8b4bc4f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1183,29 +1183,23 @@ struct load_weight {
> u32 inv_weight;
> };
>
> +/*
> + * The load_avg/util_avg represents an infinite geometric series:
> + * 1) load_avg describes the amount of time that a sched_entity
> + * is runnable on a rq. It is based on both load_sum and the
> + * weight of the task.
> + * 2) util_avg describes the amount of time that a sched_entity
> + * is running on a CPU. It is based on util_sum and is scaled
> + * in the range [0..SCHED_LOAD_SCALE].

sa->load_[avg/sum] and sa->util_[avg/sum] are also used for the
aggregated load/util values on the cfs_rq's.

> + * The 64 bit load_sum can:
> + * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
> + * the highest weight (=88761) always runnable, we should not overflow
> + * 2) for entity, support any load.weight always runnable
> + */
> struct sched_avg {
> - u64 last_runnable_update;
> - s64 decay_count;
> - /*
> - * utilization_avg_contrib describes the amount of time that a
> - * sched_entity is running on a CPU. It is based on running_avg_sum
> - * and is scaled in the range [0..SCHED_LOAD_SCALE].
> - * load_avg_contrib described the amount of time that a sched_entity
> - * is runnable on a rq. It is based on both runnable_avg_sum and the
> - * weight of the task.
> - */
> - unsigned long load_avg_contrib, utilization_avg_contrib;
> - /*
> - * These sums represent an infinite geometric series and so are bound
> - * above by 1024/(1-y). Thus we only need a u32 to store them for all
> - * choices of y < 1-2^(-32)*1024.
> - * running_avg_sum reflects the time that the sched_entity is
> - * effectively running on the CPU.
> - * runnable_avg_sum represents the amount of time a sched_entity is on
> - * a runqueue which includes the running time that is monitored by
> - * running_avg_sum.
> - */
> - u32 runnable_avg_sum, avg_period, running_avg_sum;
> + u64 last_update_time, load_sum;
> + u32 util_sum, period_contrib;
> + unsigned long load_avg, util_avg;
> };

[...]

> /*
> - * Aggregate cfs_rq runnable averages into an equivalent task_group
> - * representation for computing load contributions.
> + * Updating tg's load_avg is necessary before update_cfs_share (which is done)
> + * and effective_load (which is not done because it is too costly).
> */
> -static inline void __update_tg_runnable_avg(struct sched_avg *sa,
> - struct cfs_rq *cfs_rq)
> +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> {

This function is always called with force=0 ? I remember that there was
some discussion about this in your v5 (error bounds of '/ 64') but since
it is not used ...

> - struct task_group *tg = cfs_rq->tg;
> - long contrib;
> -
> - /* The fraction of a cpu used by this cfs_rq */
> - contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
> - sa->avg_period + 1);
> - contrib -= cfs_rq->tg_runnable_contrib;
> + long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>
> - if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
> - atomic_add(contrib, &tg->runnable_avg);
> - cfs_rq->tg_runnable_contrib += contrib;
> - }
> -}

[...]

> -static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
> -
> -/* Update a sched_entity's runnable average */
> -static inline void update_entity_load_avg(struct sched_entity *se,
> - int update_cfs_rq)
> +/* Update task and its cfs_rq load average */
> +static inline void update_load_avg(struct sched_entity *se, int update_tg)
> {
> struct cfs_rq *cfs_rq = cfs_rq_of(se);
> - long contrib_delta, utilization_delta;
> int cpu = cpu_of(rq_of(cfs_rq));
> - u64 now;
> + u64 now = cfs_rq_clock_task(cfs_rq);
>
> /*
> - * For a group entity we need to use their owned cfs_rq_clock_task() in
> - * case they are the parent of a throttled hierarchy.
> + * Track task load average for carrying it to new CPU after migrated, and
> + * track group sched_entity load average for task_h_load calc in migration
> */
> - if (entity_is_task(se))
> - now = cfs_rq_clock_task(cfs_rq);
> - else
> - now = cfs_rq_clock_task(group_cfs_rq(se));

Why don't you make this distinction while getting 'now' between se's
representing tasks or task groups anymore?

> + __update_load_avg(now, cpu, &se->avg,
> + se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == se);
>
> - if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
> - cfs_rq->curr == se))
> - return;
> -
> - contrib_delta = __update_entity_load_avg_contrib(se);
> - utilization_delta = __update_entity_utilization_avg_contrib(se);
> -
> - if (!update_cfs_rq)
> - return;
> -
> - if (se->on_rq) {
> - cfs_rq->runnable_load_avg += contrib_delta;
> - cfs_rq->utilization_load_avg += utilization_delta;
> - } else {
> - subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
> - }
> + if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
> + update_tg_load_avg(cfs_rq, 0);
> }

[...]

> -
> static void update_blocked_averages(int cpu)

The name of this function now becomes misleading since you don't update
blocked averages any more. Existing pelt calls
__update_blocked_averages_cpu() -> update_cfs_rq_blocked_load() ->
subtract_blocked_load_contrib() for all tg tree.

Whereas you update cfs_rq.avg->[load/util]_[avg/sum] and conditionally
tg->load_avg and cfs_rq->tg_load_avg_contrib.

[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/