Re: [PATCH 3/3] sched/fair: fix unnecessary increase of balance interval

From: Valentin Schneider
Date: Thu Dec 20 2018 - 06:22:50 EST


On 20/12/2018 07:55, Vincent Guittot wrote:
> In case of active balance, we increase the balance interval to cover
> pinned tasks cases not covered by all_pinned logic. Neverthless, the
> active migration triggered by asym packing should be treated as the normal
> unbalanced case and reset the interval to default value otherwise active
> migration for asym_packing can be easily delayed for hundreds of ms
> because of this pinned task detection mecanism.
> The same happen to other conditions tested in need_active_balance() like
> mistfit task and when the capacity of src_cpu is reduced compared to
> dst_cpu (see comments in need_active_balance() for details).
>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 40 +++++++++++++++++++++++++++-------------
> 1 file changed, 27 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 487c73e..9b1e701 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8849,21 +8849,25 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> */
> #define MAX_PINNED_INTERVAL 512
>
> -static int need_active_balance(struct lb_env *env)
> +static inline bool
> +asym_active_balance(struct lb_env *env)
> {
> - struct sched_domain *sd = env->sd;
> + /*
> + * ASYM_PACKING needs to force migrate tasks from busy but
> + * lower priority CPUs in order to pack all tasks in the
> + * highest priority CPUs.
> + */
> + return env->idle != CPU_NOT_IDLE && (env->sd->flags & SD_ASYM_PACKING) &&
> + sched_asym_prefer(env->dst_cpu, env->src_cpu);
> +}
>
> - if (env->idle != CPU_NOT_IDLE) {
> +static inline bool
> +voluntary_active_balance(struct lb_env *env)
> +{
> + struct sched_domain *sd = env->sd;
>
> - /*
> - * ASYM_PACKING needs to force migrate tasks from busy but
> - * lower priority CPUs in order to pack all tasks in the
> - * highest priority CPUs.
> - */
> - if ((sd->flags & SD_ASYM_PACKING) &&
> - sched_asym_prefer(env->dst_cpu, env->src_cpu))
> - return 1;
> - }
> + if (asym_active_balance(env))
> + return 1;
>
> /*
> * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
> @@ -8881,6 +8885,16 @@ static int need_active_balance(struct lb_env *env)
> if (env->src_grp_type == group_misfit_task)
> return 1;
>
> + return 0;
> +}
> +

Yeah so that's the active balance classification I was afraid of, and I
don't agree with it.

The way I see things, we happen to have some mechanisms that let us know
straight away if we need an active balance (asym packing, misfit, lowered
capacity), and we rely on the sd->nr_balance_failed threshold for the
scenarios where we don't have any more information.

We do happen to have a threshold because we don't want to go overboard with
it, but when it is reached it's a clear sign that we *do* want to active
balance because that's all we can do to try and solve the imbalance.

To me, those are all legitimate reasons to. So they're all "voluntary"
really, we *do* want all of these.

> +static int need_active_balance(struct lb_env *env)
> +{
> + struct sched_domain *sd = env->sd;
> +
> + if (voluntary_active_balance(env))
> + return 1;
> +
> return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
> }
>
> @@ -9142,7 +9156,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> } else
> sd->nr_balance_failed = 0;
>
> - if (likely(!active_balance)) {
> + if (likely(!active_balance) || voluntary_active_balance(&env)) {

So now we reset the interval for all active balances (expect last active
balance case), even when it is done as a last resort because all other
tasks were pinned.

Arguably the current code isn't much better (always increase the interval
when active balancing), but at least it covers this case. It would be a
waste not to take this into account when we can detect this scenario
(I'll reiterate my LBF_ALL_PINNED suggestion).

> /* We were unbalanced, so reset the balancing interval */
> sd->balance_interval = sd->min_interval;
> } else {
>