Re: [PATCH v3 6/7] sched/fair: Filter out locally-unsolvable misfit imbalances

From: Vincent Guittot
Date: Fri Mar 19 2021 - 11:20:37 EST


On Mon, 15 Mar 2021 at 20:18, Valentin Schneider
<valentin.schneider@xxxxxxx> wrote:
>
> On 15/03/21 16:13, Vincent Guittot wrote:
> > On Thu, 11 Mar 2021 at 13:05, Valentin Schneider
> > <valentin.schneider@xxxxxxx> wrote:
> >>
> >> Consider the following (hypothetical) asymmetric CPU capacity topology,
> >> with some amount of capacity pressure (RT | DL | IRQ | thermal):
> >>
> >> DIE [ ]
> >> MC [ ][ ]
> >> 0 1 2 3
> >>
> >> | CPU | capacity_orig | capacity |
> >> |-----+---------------+----------|
> >> | 0 | 870 | 860 |
> >> | 1 | 870 | 600 |
> >> | 2 | 1024 | 850 |
> >> | 3 | 1024 | 860 |
> >>
> >> If CPU1 has a misfit task, then CPU0, CPU2 and CPU3 are valid candidates to
> >> grant the task an uplift in CPU capacity. Consider CPU0 and CPU3 as
> >> sufficiently busy, i.e. don't have enough spare capacity to accommodate
> >> CPU1's misfit task. This would then fall on CPU2 to pull the task.
> >>
> >> This currently won't happen, because CPU2 will fail
> >>
> >> capacity_greater(capacity_of(CPU2), sg->sgc->max_capacity)
> >
> > which has been introduced by the previous patch: patch5
> >
> >>
> >> in update_sd_pick_busiest(), where 'sg' is the [0, 1] group at DIE
> >> level. In this case, the max_capacity is that of CPU0's, which is at this
> >> point in time greater than that of CPU2's. This comparison doesn't make
> >> much sense, given that the only CPUs we should care about in this scenario
> >> are CPU1 (the CPU with the misfit task) and CPU2 (the load-balance
> >> destination CPU).
> >>
> >> Aggregate a misfit task's load into sgs->group_misfit_task_load only if
> >> env->dst_cpu would grant it a capacity uplift. Separately track whether a
> >> sched_group contains a misfit task to still classify it as
> >> group_misfit_task and not pick it as busiest group when pulling from a
> >
> > Could you give more details about why we should keep tracking the
> > group as misfit ? Do you have a UC in mind ?
> >
>
> As stated the current behaviour is to classify groups as group_misfit_task
> regardless of the dst_cpu's capacity. When we see a group_misfit_task
> candidate group misfit task with higher per-CPU capacity than the local
> group, we don't pick it as busiest.
>
> I initially thought not marking those as group_misfit_task was the right
> thing to do, as they could then be classified as group_fully_busy or
> group_has_spare. Consider:
>
> DIE [ ]
> MC [ ][ ]
> 0 1 2 3
> L L B B
>
> arch_scale_capacity(L) < arch_scale_capacity(B)
>
> CPUs 0-1 are idle / lightly loaded
> CPU2 has a misfit task and a few very small tasks
> CPU3 has a few very small tasks
>
> When CPU0 is running load_balance() at DIE level, right now we'll classify
> the [2-3] group as group_misfit_task and not pick it as busiest because the
> local group has a lower CPU capacity.
>
> If we didn't do that, we could leave the misfit task alone and pull some
> small task(s) from CPU2 or CPU3, which would be a good thing to

Are you sure? the last check in update_sd_pick_busiest() should
already filter this. So it should be enough to let it be classify
correctly

A group should be classified as group_misfit_task when there is a task
to migrate in priority compared to some other groups. In your case,
you tag it as group_misfit_task but in order to do the opposite, i.e.
make sure to not select it. As mentioned above, this will be filter in
the last check in update_sd_pick_busiest()

> do. However, by allowing a group containing a misfit task to be picked as
> the busiest group when a CPU of lower capacity is pulling, we run the risk
> of the misfit task itself being downmigrated - e.g. if we repeatedly
> increment the sd->nr_balance_failed counter and do an active balance (maybe
> because the small tasks were unfortunately cache_hot()).
>
> It's less than ideal, but I considered not downmigrating misfit tasks was
> the thing to prioritize (and FWIW it also maintains current behaviour).
>
>
> Another approach would be to add task utilization vs CPU capacity checks in
> detach_tasks() and need_active_balance() to prevent downmigration when
> env->imbalance_type < group_misfit_task. This may go against the busiest
> group selection heuristics however (misfit tasks could be the main
> contributors to the imbalance, but we end up not moving them).
>
>
> >> lower-capacity CPU (which is the current behaviour and prevents
> >> down-migration).
> >>
> >> Since find_busiest_queue() can now iterate over CPUs with a higher capacity
> >> than the local CPU's, add a capacity check there.
> >>
> >> Reviewed-by: Qais Yousef <qais.yousef@xxxxxxx>
> >> Signed-off-by: Valentin Schneider <valentin.schneider@xxxxxxx>
>
> >> @@ -8447,10 +8454,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >> continue;
> >>
> >> /* Check for a misfit task on the cpu */
> >> - if (sd_has_asym_cpucapacity(env->sd) &&
> >> - sgs->group_misfit_task_load < rq->misfit_task_load) {
> >> - sgs->group_misfit_task_load = rq->misfit_task_load;
> >> - *sg_status |= SG_OVERLOAD;
> >> + if (!sd_has_asym_cpucapacity(env->sd) ||
> >> + !rq->misfit_task_load)
> >> + continue;
> >> +
> >> + *sg_status |= SG_OVERLOAD;
> >> + sgs->group_has_misfit_task = true;
> >> +
> >> + /*
> >> + * Don't attempt to maximize load for misfit tasks that can't be
> >> + * granted a CPU capacity uplift.
> >> + */
> >> + if (cpu_capacity_greater(env->dst_cpu, i)) {
> >> + sgs->group_misfit_task_load = max(
> >> + sgs->group_misfit_task_load,
> >> + rq->misfit_task_load);
> >
> > Please encapsulate all this misfit specific code in a dedicated
> > function which will be called from update_sg_lb_stats
> >
>
> Will do.