Re: [PATCH v2 4/8] sched/fair: rework load_balance

From: Vincent Guittot
Date: Mon Aug 26 2019 - 05:31:26 EST

Next message: Peter Zijlstra: "Re: [PATCH] sched/cpufreq: Align trace event behavior of fast switching"
Previous message: Ingo Molnar: "Re: [PATCH 4/5] x86/intel: Aggregate microserver naming"
In reply to: Valentin Schneider: "Re: [PATCH v2 4/8] sched/fair: rework load_balance"
Next in thread: Vincent Guittot: "Re: [PATCH v2 4/8] sched/fair: rework load_balance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 6 Aug 2019 at 17:56, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Thu, Aug 01, 2019 at 04:40:20PM +0200, Vincent Guittot wrote:
> > The load_balance algorithm contains some heuristics which have becomes
> > meaningless since the rework of metrics and the introduction of PELT.
> >
> > Furthermore, it's sometimes difficult to fix wrong scheduling decisions
> > because everything is based on load whereas some imbalances are not
> > related to the load. The current algorithm ends up to create virtual and
> > meaningless value like the avg_load_per_task or tweaks the state of a
> > group to make it overloaded whereas it's not, in order to try to migrate
> > tasks.
> >
> > load_balance should better qualify the imbalance of the group and define
> > cleary what has to be moved to fix this imbalance.
> >
> > The type of sched_group has been extended to better reflect the type of
> > imbalance. We now have :
> > group_has_spare
> > group_fully_busy
> > group_misfit_task
> > group_asym_capacity
> > group_imbalanced
> > group_overloaded
> >
> > Based on the type of sched_group, load_balance now sets what it wants to
> > move in order to fix the imnbalance. It can be some load as before but
> > also some utilization, a number of task or a type of task:
> > migrate_task
> > migrate_util
> > migrate_load
> > migrate_misfit
> >
> > This new load_balance algorithm fixes several pending wrong tasks
> > placement:
> > - the 1 task per CPU case with asymetrics system
> > - the case of cfs task preempted by other class
> > - the case of tasks not evenly spread on groups with spare capacity
> >
> > The load balance decisions have been gathered in 3 functions:
> > - update_sd_pick_busiest() select the busiest sched_group.
> > - find_busiest_group() checks if there is an imabalance between local and
> > busiest group.
> > - calculate_imbalance() decides what have to be moved.
>
> I like this lots, very good stuff.
>
> It is weird how misfit is still load, but istr later patches cure that.
>
> Here's a whole pile of nits, some overlap with what Valentin already
> noted. His suggestions for the changelog also make sense to me.

ok.
Thanks

>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8205,10 +8205,10 @@ static bool update_sd_pick_busiest(struc
> {
> struct sg_lb_stats *busiest = &sds->busiest_stat;
>
> -
> /* Make sure that there is at least one task to pull */
> if (!sgs->sum_h_nr_running)
> return false;
> +
> /*
> * Don't try to pull misfit tasks we can't help.
> * We can use max_capacity here as reduction in capacity on some
> @@ -8239,11 +8239,11 @@ static bool update_sd_pick_busiest(struc
> break;
>
> case group_imbalanced:
> - /* Select the 1st imbalanced group as we don't have
> - * any way to choose one more than another
> + /*
> + * Select the 1st imbalanced group as we don't have any way to
> + * choose one more than another
> */
> return false;
> - break;
>
> case group_asym_capacity:
> /* Prefer to move from lowest priority CPU's work */
> @@ -8253,8 +8253,8 @@ static bool update_sd_pick_busiest(struc
>
> case group_misfit_task:
> /*
> - * If we have more than one misfit sg go with the
> - * biggest misfit.
> + * If we have more than one misfit sg go with the biggest
> + * misfit.
> */
> if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
> return false;
> @@ -8262,14 +8262,14 @@ static bool update_sd_pick_busiest(struc
>
> case group_fully_busy:
> /*
> - * Select the fully busy group with highest avg_load.
> - * In theory, there is no need to pull task from such
> - * kind of group because tasks have all compute
> - * capacity that they need but we can still improve the
> - * overall throughput by reducing contention when
> - * accessing shared HW resources.
> - * XXX for now avg_load is not computed and always 0 so
> - * we select the 1st one.
> + * Select the fully busy group with highest avg_load. In
> + * theory, there is no need to pull task from such kind of
> + * group because tasks have all compute capacity that they need
> + * but we can still improve the overall throughput by reducing
> + * contention when accessing shared HW resources.
> + *
> + * XXX for now avg_load is not computed and always 0 so we
> + * select the 1st one.
> */
> if (sgs->avg_load <= busiest->avg_load)
> return false;
> @@ -8277,11 +8277,11 @@ static bool update_sd_pick_busiest(struc
>
> case group_has_spare:
> /*
> - * Select not overloaded group with lowest number of
> - * idle cpus. We could also compare the spare capacity
> - * which is more stable but it can end up that the
> - * group has less spare capacity but finally more idle
> - * cpus which means less opportunity to pull tasks.
> + * Select not overloaded group with lowest number of idle cpus.
> + * We could also compare the spare capacity which is more
> + * stable but it can end up that the group has less spare
> + * capacity but finally more idle cpus which means less
> + * opportunity to pull tasks.
> */
> if (sgs->idle_cpus >= busiest->idle_cpus)
> return false;
> @@ -8289,10 +8289,10 @@ static bool update_sd_pick_busiest(struc
> }
>
> /*
> - * Candidate sg has no more than one task per CPU and
> - * has higher per-CPU capacity. Migrating tasks to less
> - * capable CPUs may harm throughput. Maximize throughput,
> - * power/energy consequences are not considered.
> + * Candidate sg has no more than one task per CPU and has higher
> + * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> + * throughput. Maximize throughput, power/energy consequences are not
> + * considered.
> */
> if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> (sgs->group_type <= group_fully_busy) &&
> @@ -8428,6 +8428,7 @@ static inline void calculate_imbalance(s
>
> local = &sds->local_stat;
> busiest = &sds->busiest_stat;
> +
> if (busiest->group_type == group_imbalanced) {
> /*
> * In the group_imb case we cannot rely on group-wide averages
> @@ -8442,8 +8443,8 @@ static inline void calculate_imbalance(s
>
> if (busiest->group_type == group_asym_capacity) {
> /*
> - * In case of asym capacity, we will try to migrate all load
> - * to the preferred CPU
> + * In case of asym capacity, we will try to migrate all load to
> + * the preferred CPU
> */
> env->balance_type = migrate_load;
> env->imbalance = busiest->group_load;
> @@ -8483,7 +8484,8 @@ static inline void calculate_imbalance(s
> * groups.
> */
> env->balance_type = migrate_task;
> - env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1;
> + env->imbalance = (busiest->sum_h_nr_running -
> + local->sum_h_nr_running) >> 1;
> return;
> }
>
> @@ -8502,7 +8504,7 @@ static inline void calculate_imbalance(s
> */
> if (local->group_type < group_overloaded) {
> /*
> - * Local will become overvloaded so the avg_load metrics are
> + * Local will become overloaded so the avg_load metrics are
> * finally needed
> */
>
> @@ -8514,12 +8516,12 @@ static inline void calculate_imbalance(s
> }
>
> /*
> - * Both group are or will become overloaded and we're trying to get
> - * all the CPUs to the average_load, so we don't want to push
> - * ourselves above the average load, nor do we wish to reduce the
> - * max loaded CPU below the average load. At the same time, we also
> - * don't want to reduce the group load below the group capacity.
> - * Thus we look for the minimum possible imbalance.
> + * Both group are or will become overloaded and we're trying to get all
> + * the CPUs to the average_load, so we don't want to push ourselves
> + * above the average load, nor do we wish to reduce the max loaded CPU
> + * below the average load. At the same time, we also don't want to
> + * reduce the group load below the group capacity. Thus we look for
> + * the minimum possible imbalance.
> */
> env->balance_type = migrate_load;
> env->imbalance = min(
> @@ -8641,7 +8643,6 @@ static struct sched_group *find_busiest_
> if (100 * busiest->avg_load <=
> env->sd->imbalance_pct * local->avg_load)
> goto out_balanced;
> -
> }
>
> /* Try to move all excess tasks to child's sibling domain */
> @@ -8651,7 +8652,7 @@ static struct sched_group *find_busiest_
>
> if (busiest->group_type != group_overloaded &&
> (env->idle == CPU_NOT_IDLE ||
> - local->idle_cpus <= (busiest->idle_cpus + 1)))
> + local->idle_cpus <= (busiest->idle_cpus + 1))) {
> /*
> * If the busiest group is not overloaded
> * and there is no imbalance between this and busiest group
> @@ -8661,15 +8662,14 @@ static struct sched_group *find_busiest_
> * group.
> */
> goto out_balanced;
> + }
>
> force_balance:
> /* Looks like there is an imbalance. Compute it */
> calculate_imbalance(env, &sds);
> -
> return env->imbalance ? sds.busiest : NULL;
>
> out_balanced:
> -
> env->imbalance = 0;
> return NULL;
> }

Next message: Peter Zijlstra: "Re: [PATCH] sched/cpufreq: Align trace event behavior of fast switching"
Previous message: Ingo Molnar: "Re: [PATCH 4/5] x86/intel: Aggregate microserver naming"
In reply to: Valentin Schneider: "Re: [PATCH v2 4/8] sched/fair: rework load_balance"
Next in thread: Vincent Guittot: "Re: [PATCH v2 4/8] sched/fair: rework load_balance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]