Re: [PATCH] sched/fair: Force balancing on nohz balance if local group has capacity

From: Brendan Jackman
Date: Wed Sep 20 2017 - 13:21:25 EST


Hi Peter, Josef,

Do you have any thoughts on this one?

On Mon, Aug 07 2017 at 16:39, Brendan Jackman wrote:
> The "goto force_balance" here is intended to mitigate the fact that
> avg_load calculations can result in bad placement decisions when
> priority is asymmetrical. From the original commit (fab476228ba3
> "sched: Force balancing on newidle balance if local group has
> capacity") that adds it:
>
> Under certain situations, such as a niced down task (i.e. nice =
> -15) in the presence of nr_cpus NICE0 tasks, the niced task lands
> on a sched group and kicks away other tasks because of its large
> weight. This leads to sub-optimal utilization of the
> machine. Even though the sched group has capacity, it does not
> pull tasks because sds.this_load >> sds.max_load, and f_b_g()
> returns NULL.
>
> A similar but inverted issue also affects ARM
> big.LITTLE (asymmetrical CPU capacity) systems - consider 8
> always-running, same-priority tasks on a system with 4 "big" and 4
> "little" CPUs. Suppose that 5 of them end up on the "big" CPUs (which
> will be represented by one sched_group in the DIE sched_domain) and 3
> on the "little" (the other sched_group in DIE), leaving one CPU
> unused. Because the "big" group has a higher group_capacity its
> avg_load may not present an imbalance that would cause migrating a
> task to the idle "little".
>
> The force_balance case here solves the problem but currently only for
> CPU_NEWLY_IDLE balances, which in theory might never happen on the
> unused CPU. Including CPU_IDLE in the force_balance case means
> there's an upper bound on the time before we can attempt to solve the
> underutilization: after DIE's sd->balance_interval has passed the
> next nohz balance kick will help us out.
>
> Signed-off-by: Brendan Jackman <brendan.jackman@xxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Mike Galbraith <efault@xxxxxx>
> Cc: Paul Turner <pjt@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c95880e216f6..63eff3e881a0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7801,8 +7801,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> if (busiest->group_type == group_imbalanced)
> goto force_balance;
>
> - /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
> - if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
> + /*
> + * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
> + * capacities from resulting in underutilization due to avg_load.
> + */
> + if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> busiest->group_no_capacity)
> goto force_balance;