Re: [PATCH 1/2] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on

From: Peter Zijlstra
Date: Tue Feb 13 2018 - 05:45:56 EST


On Mon, Feb 12, 2018 at 05:11:30PM +0000, Mel Gorman wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 50442697b455..0192448e43a2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5917,6 +5917,18 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> if (!idlest)
> return NULL;
>
> + /*
> + * When comparing groups across NUMA domains, it's possible for the
> + * local domain to be very lightly loaded relative to the remote
> + * domains but "imbalance" skews the comparison making remote CPUs
> + * look much more favourable. When considering cross-domain, add
> + * imbalance to the runnable load on the remote node and consider
> + * staying local.
> + */
> + if ((sd->flags & SD_NUMA) &&
> + min_runnable_load + imbalance >= this_runnable_load)
> + return NULL;
> +
> if (min_runnable_load > (this_runnable_load + imbalance))
> return NULL;

So this is basically a spread vs group decision, which we typically do
using SD_PREFER_SIBLNG. Now that flag is a bit awkward in that its set
on the child domain.

Now, we set it for SD_SHARE_PKG_RESOURCES (aka LLC), which means that for
our typical modern NUMA system we indicate we want to spread between the
lowest NUMA level. And regular load balancing will do so.

Now you modify the idlest code for initial placement to go against the
stable behaviour, which is unfortunate.

However, if we have numa balancing enabled, that will counteract
the normal spreading across nodes, so in that regard it makes sense, but
the above code is not conditional on numa balancing.

I'm torn and confused...