Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

From: Morten Rasmussen
Date: Wed Nov 30 2016 - 07:49:29 EST


On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote:
> find_idlest_group() only compares the runnable_load_avg when looking for
> the least loaded group. But on fork intensive use case like hackbench
> where tasks blocked quickly after the fork, this can lead to selecting the
> same CPU instead of other CPUs, which have similar runnable load but a
> lower load_avg.
>
> When the runnable_load_avg of 2 CPUs are close, we now take into account
> the amount of blocked load as a 2nd selection factor. There is now 3 zones
> for the runnable_load of the rq:
> -[0 .. (runnable_load - imbalance)] : Select the new rq which has
> significantly less runnable_load
> -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The
> runnable load are close so we use load_avg to chose between the 2 rq
> -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which
> has significantly less runnable_load
>
> For use case like hackbench, this enable the scheduler to select different
> CPUs during the fork sequence and to spread tasks across the system.
>
> Tests have been done on a Hikey board (ARM based octo cores) for several
> kernel. The result below gives min, max, avg and stdev values of 18 runs
> with each configuration.
>
> The v4.8+patches configuration also includes the changes below which is
> part of the proposal made by Peter to ensure that the clock will be up to
> date when the fork task will be attached to the rq.
>
> @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
> __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> #endif
> rq = __task_rq_lock(p, &rf);
> + update_rq_clock(rq);
> post_init_entity_util_avg(&p->se);
>
> activate_task(rq, p, 0);
>
> hackbench -P -g 1
>
> ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches
> min 0.049 0.050 0.051 0,048
> avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%)
> max 0.066 0.068 0.070 0,063
> stdev +/-9% +/-9% +/-8% +/-9%
>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---
>
> Changes since v2:
> - Rebase on latest sched/core
> - Get same results with the rebase and the fix mentioned in patch 01
>
> kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 38 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 820a787..ecb5ee8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5395,16 +5395,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> {
> struct sched_group *idlest = NULL, *group = sd->groups;
> struct sched_group *most_spare_sg = NULL;
> - unsigned long min_load = ULONG_MAX, this_load = 0;
> + unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
> + unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
> unsigned long most_spare = 0, this_spare = 0;
> int load_idx = sd->forkexec_idx;
> - int imbalance = 100 + (sd->imbalance_pct-100)/2;
> + int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
> + unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
> + (sd->imbalance_pct-100) / 100;
>
> if (sd_flag & SD_BALANCE_WAKE)
> load_idx = sd->wake_idx;
>
> do {
> - unsigned long load, avg_load, spare_cap, max_spare_cap;
> + unsigned long load, avg_load, runnable_load;
> + unsigned long spare_cap, max_spare_cap;
> int local_group;
> int i;
>
> @@ -5421,6 +5425,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> * the group containing the CPU with most spare capacity.
> */
> avg_load = 0;
> + runnable_load = 0;
> max_spare_cap = 0;
>
> for_each_cpu(i, sched_group_cpus(group)) {
> @@ -5430,7 +5435,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> else
> load = target_load(i, load_idx);
>
> - avg_load += load;
> + runnable_load += load;
> +
> + avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
>
> spare_cap = capacity_spare_wake(i, p);
>
> @@ -5439,14 +5446,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> }
>
> /* Adjust by relative CPU capacity of the group */
> - avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
> + avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
> + group->sgc->capacity;
> + runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
> + group->sgc->capacity;
>
> if (local_group) {
> - this_load = avg_load;
> + this_runnable_load = runnable_load;
> + this_avg_load = avg_load;
> this_spare = max_spare_cap;
> } else {
> - if (avg_load < min_load) {
> - min_load = avg_load;
> + if (min_runnable_load > (runnable_load + imbalance)) {
> + /*
> + * The runnable load is significantly smaller
> + * so we can pick this new cpu
> + */
> + min_runnable_load = runnable_load;
> + min_avg_load = avg_load;
> + idlest = group;
> + } else if ((runnable_load < (min_runnable_load + imbalance)) &&
> + (100*min_avg_load > imbalance_scale*avg_load)) {
> + /*
> + * The runnable loads are close so we take
> + * into account blocked load through avg_load
> + * which is blocked + runnable load
> + */
> + min_avg_load = avg_load;
> idlest = group;
> }
>
> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> goto no_spare;
>
> if (this_spare > task_util(p) / 2 &&
> - imbalance*this_spare > 100*most_spare)
> + imbalance_scale*this_spare > 100*most_spare)
> return NULL;
> else if (most_spare > task_util(p) / 2)
> return most_spare_sg;
>
> no_spare:
> - if (!idlest || 100*this_load < imbalance*min_load)
> + if (!idlest ||
> + (min_runnable_load > (this_runnable_load + imbalance)) ||
> + ((this_runnable_load < (min_runnable_load + imbalance)) &&
> + (100*min_avg_load > imbalance_scale*this_avg_load)))

I don't get why you have imbalance_scale applied to this_avg_load and
not min_avg_load. IIUC, you end up preferring non-local groups?

If we take the example where this_runnable_load == min_runnable_load and
this_avg_load == min_avg_load. In this case, and in cases where
min_avg_load is slightly bigger than this_avg_load, we end up picking
the 'idlest' group even if the local group is equally good or even
slightly better?

> return NULL;
> return idlest;
> }

Overall, I like that load_avg is being brought in to make better
decisions. The variable naming is a bit confusing. For example,
runnable_load is a capacity-average just like avg_load. 'imbalance' is
now an absolute capacity-average margin, but it is hard to come up with
better short alternatives.

Although 'imbalance' is based on the existing imbalance_pct, I find
somewhat arbitrary. Why is (imbalance_pct-100)*1024/100 a good absolute
margin to define the interval where we want to consider load_avg? I
guess it is case of 'we had to pick some value', which we have done in
many other places. Though, IMHO, it is a bit strange that imbalance_pct
is used in two different ways to bias comparison in the same function.
It used to be only used as a scaling factor (now imbalance_scale), while
this patch proposes to use it for computing an absolute margin
(imbalance) as well. It is not major issue, but it is not clear why it
is used differently to compare two metrics that are relatively closely
related.

Morten