Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

From: Vincent Guittot
Date: Fri Aug 27 2021 - 06:13:57 EST


On Tue, 10 Aug 2021 at 16:41, Ricardo Neri
<ricardo.neri-calderon@xxxxxxxxxxxxxxx> wrote:
>
> When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> check for the idle state of the destination CPU, dst_cpu, but also of
> its SMT siblings.
>
> If dst_cpu is idle but its SMT siblings are busy, performance suffers
> if it pulls tasks from a medium priority CPU that does not have SMT
> siblings.
>
> Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> siblings of both dst_cpu and the CPUs in the candidate busiest group.
>
> Cc: Aubrey Li <aubrey.li@xxxxxxxxx>
> Cc: Ben Segall <bsegall@xxxxxxxxxx>
> Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Quentin Perret <qperret@xxxxxxxxxx>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Reviewed-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> Reviewed-by: Len Brown <len.brown@xxxxxxxxx>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@xxxxxxxxxxxxxxx>
> ---
> Changes since v3:
> * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
> powerpc folks showed that this patch should not impact them. Also, more
> recent powerpc processor no longer use asym_packing. (PeterZ)
> * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
> * Removed unnecessary check for local CPUs when the local group has zero
> utilization. (Joel)
> * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
> the fact that it deals with SMT cases.
> * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
> that callers can deal with non-SMT cases.
>
> Changes since v2:
> * Reworded the commit message to reflect updates in code.
> * Corrected misrepresentation of dst_cpu as the CPU doing the load
> balancing. (PeterZ)
> * Removed call to arch_asym_check_smt_siblings() as it is now called in
> sched_asym().
>
> Changes since v1:
> * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
> tasks. Instead, reclassify the candidate busiest group, as it
> may still be selected. (PeterZ)
> * Avoid an expensive and unnecessary call to cpumask_weight() when
> determining if a sched_group is comprised of SMT siblings.
> (PeterZ).
> ---
> kernel/sched/fair.c | 95 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 95 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd411cefb63f..8a1a2a43732c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8531,10 +8531,99 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
> }
>
> +/**
> + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
> + * @dst_cpu: Destination CPU of the load balancing
> + * @sds: Load-balancing data with statistics of the local group
> + * @sgs: Load-balancing statistics of the candidate busiest group
> + * @sg: The candidate busiet group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
> + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
> + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings, even the number of idle CPUs
> + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> + * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
> + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> + * has lower priority.
> + */
> +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> + struct sg_lb_stats *sgs,
> + struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> + bool local_is_smt, sg_is_smt;
> + int sg_busy_cpus;
> +
> + local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> + sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> +
> + sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> + if (!local_is_smt) {
> + /*
> + * If we are here, @dst_cpu is idle and does not have SMT
> + * siblings. Pull tasks if candidate group has two or more
> + * busy CPUs.
> + */
> + if (sg_is_smt && sg_busy_cpus >= 2)
> + return true;
> +
> + /*
> + * @dst_cpu does not have SMT siblings. @sg may have SMT
> + * siblings and only one is busy. In such case, @dst_cpu
> + * can help if it has higher priority and is idle.
> + */
> + return !sds->local_stat.group_util &&

sds->local_stat.group_util can't be used to decide if a CPU or group
of CPUs is idle. util_avg is usually not null when a CPU becomes idle
and you can have to wait more than 300ms before it becomes Null
At the opposite, the utilization of a CPU can be null but a task with
null utilization has just woken up on it.
Utilization is used to reflect the average work of the CPU or group of
CPUs but not the current state

> + sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> + }
> +
> + /* @dst_cpu has SMT siblings. */
> +
> + if (sg_is_smt) {
> + int local_busy_cpus = sds->local->group_weight -
> + sds->local_stat.idle_cpus;
> + int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> +
> + /* Local can always help to even the number busy CPUs. */

default behavior of the load balance already tries to even the number
of idle CPUs.

> + if (busy_cpus_delta >= 2)
> + return true;
> +
> + if (busy_cpus_delta == 1)
> + return sched_asym_prefer(dst_cpu,
> + sg->asym_prefer_cpu);
> +
> + return false;
> + }
> +
> + /*
> + * @sg does not have SMT siblings. Ensure that @sds::local does not end
> + * up with more than one busy SMT sibling and only pull tasks if there
> + * are not busy CPUs. As CPUs move in and out of idle state frequently,
> + * also check the group utilization to smoother the decision.
> + */
> + if (!sds->local_stat.group_util)

same comment as above about the meaning of group_util == 0

> + return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> +
> + return false;
> +#else
> + /* Always return false so that callers deal with non-SMT cases. */
> + return false;
> +#endif
> +}
> +
> static inline bool
> sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
> struct sched_group *group)
> {
> + /* Only do SMT checks if either local or candidate have SMT siblings */
> + if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> + (group->flags & SD_SHARE_CPUCAPACITY))
> + return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> +
> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> }
>
> @@ -9540,6 +9629,12 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> nr_running == 1)
> continue;
>
> + /* Make sure we only pull tasks from a CPU of lower priority */
> + if ((env->sd->flags & SD_ASYM_PACKING) &&
> + sched_asym_prefer(i, env->dst_cpu) &&
> + nr_running == 1)
> + continue;

This really looks similar to the test above for SD_ASYM_CPUCAPACITY.
More generally speaking SD_ASYM_PACKING and SD_ASYM_CPUCAPACITY share
a lot of common policy and I wonder if at some point we could not
merge their behavior in LB

> +
> switch (env->migration_type) {
> case migrate_load:
> /*
> --
> 2.17.1
>