Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains

From: Vincent Guittot
Date: Fri Jul 06 2018 - 06:18:35 EST


On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
>
> The 'prefer sibling' sched_domain flag is intended to encourage
> spreading tasks to sibling sched_domain to take advantage of more caches
> and core for SMT systems. It has recently been changed to be on all
> non-NUMA topology level. However, spreading across domains with cpu
> capacity asymmetry isn't desirable, e.g. spreading from high capacity to
> low capacity cpus even if high capacity cpus aren't overutilized might
> give access to more cache but the cpu will be slower and possibly lead
> to worse overall throughput.
>
> To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
> level immediately below SD_ASYM_CPUCAPACITY.

This makes sense. Nevertheless, this patch also raises a scheduling
problem and break the 1 task per CPU policy that is enforced by
SD_PREFER_SIBLING. When running the tests of your cover letter, 1 long
running task is often co scheduled on a big core whereas short pinned
tasks are still running and a little core is idle which is not an
optimal scheduling decision

>
> cc: Ingo Molnar <mingo@xxxxxxxxxx>
> cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> ---
> kernel/sched/topology.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 29c186961345..00c7a08c7f77 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1140,7 +1140,7 @@ sd_init(struct sched_domain_topology_level *tl,
> | 0*SD_SHARE_CPUCAPACITY
> | 0*SD_SHARE_PKG_RESOURCES
> | 0*SD_SERIALIZE
> - | 0*SD_PREFER_SIBLING
> + | 1*SD_PREFER_SIBLING
> | 0*SD_NUMA
> | sd_flags
> ,
> @@ -1186,17 +1186,21 @@ sd_init(struct sched_domain_topology_level *tl,
> if (sd->flags & SD_ASYM_CPUCAPACITY) {
> struct sched_domain *t = sd;
>
> + /*
> + * Don't attempt to spread across cpus of different capacities.
> + */
> + if (sd->child)
> + sd->child->flags &= ~SD_PREFER_SIBLING;
> +
> for_each_lower_domain(t)
> t->flags |= SD_BALANCE_WAKE;
> }
>
> if (sd->flags & SD_SHARE_CPUCAPACITY) {
> - sd->flags |= SD_PREFER_SIBLING;
> sd->imbalance_pct = 110;
> sd->smt_gain = 1178; /* ~15% */
>
> } else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> - sd->flags |= SD_PREFER_SIBLING;
> sd->imbalance_pct = 117;
> sd->cache_nice_tries = 1;
> sd->busy_idx = 2;
> @@ -1207,6 +1211,7 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->busy_idx = 3;
> sd->idle_idx = 2;
>
> + sd->flags &= ~SD_PREFER_SIBLING;
> sd->flags |= SD_SERIALIZE;
> if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
> sd->flags &= ~(SD_BALANCE_EXEC |
> @@ -1216,7 +1221,6 @@ sd_init(struct sched_domain_topology_level *tl,
>
> #endif
> } else {
> - sd->flags |= SD_PREFER_SIBLING;
> sd->cache_nice_tries = 1;
> sd->busy_idx = 2;
> sd->idle_idx = 1;
> --
> 2.7.4
>