Re: [PATCH 03/10] sched,fair: redefine runnable_load_avg as the sum of task_h_load

From: Josef Bacik
Date: Mon Jul 01 2019 - 12:29:55 EST


On Fri, Jun 28, 2019 at 04:49:06PM -0400, Rik van Riel wrote:
> The runnable_load magic is used to quickly propagate information about
> runnable tasks up the hierarchy of runqueues. The runnable_load_avg is
> mostly used for the load balancing code, which only examines the value at
> the root cfs_rq.
>
> Redefine the root cfs_rq runnable_load_avg to be the sum of task_h_loads
> of the runnable tasks. This works because the hierarchical runnable_load of
> a task is already equal to the task_se_h_load today. This provides enough
> information to the load balancer.
>
> The runnable_load_avg of the cgroup cfs_rqs does not appear to be
> used for anything, so don't bother calculating those.
>
> This removes one of the things that the code currently traverses the
> cgroup hierarchy for, and getting rid of it brings us one step closer
> to a flat runqueue for the CPU controller.
>

My memory on this stuff is very hazy, but IIRC we had the runnable_sum and the
runnable_avg separated out because you could have the avg lag behind the sum.
So like you enqueue a bunch of new entities who's avg may have decayed a bunch
and so their overall load is not felt on the CPU until they start running, and
now you've overloaded that CPU. The sum was there to make sure new things
coming onto the CPU added actual load to the queue instead of looking like there
was no load.

Is this going to be a problem now with this new code?

<snip>

> +static inline void
> +update_runnable_load_avg(struct sched_entity *se)
> +{
> + struct cfs_rq *root_cfs_rq = &cfs_rq_of(se)->rq->cfs;
> + long new_h_load, delta;
> +
> + SCHED_WARN_ON(!entity_is_task(se));
> +
> + if (!se->on_rq)
> + return;
>
> - sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg);
> - sub_positive(&cfs_rq->avg.runnable_load_sum,
> - se_runnable(se) * se->avg.runnable_load_sum);
> + new_h_load = task_se_h_load(se);
> + delta = new_h_load - se->enqueued_h_load;
> + root_cfs_rq->avg.runnable_load_avg += delta;

Should we use add_positive here? Thanks,

Josef