[patch] sched: Fix smp nice induced group scheduling load distribution woes

From: Mike Galbraith
Date: Wed Apr 27 2016 - 03:09:59 EST


On Mon, 2016-04-25 at 11:18 +0200, Mike Galbraith wrote:
> On Sun, 2016-04-24 at 09:05 +0200, Mike Galbraith wrote:
> > On Sat, 2016-04-23 at 18:38 -0700, Brendan Gregg wrote:
> >
> > > The bugs they found seem real, and their analysis is great
> > > (although
> > > using visualizations to find and fix scheduler bugs isn't new),
> > > and it
> > > would be good to see these fixed. However, it would also be
> > > useful to
> > > double check how widespread these issues really are. I suspect
> > > many on
> > > this list can test these patches in different environments.
> >
> > Part of it sounded to me very much like they're meeting and
> > "fixing"
> > SMP group fairness...
>
> Ew, NUMA boxen look like they could use a hug or two. Add a group of
> one hog to compete with a box wide kbuild, ~lose a node.

sched: Fix smp nice induced group scheduling load distribution woes

On even a modest sized NUMA box any load that wants to scale
is essentially reduced to SCHED_IDLE class by smp nice scaling.
Limit niceness to prevent cramming a box wide load into a too
small space. Given niceness affects latency, give the user the
option to completely disable box wide group fairness as well.

time make -j192 modules on a 4 node NUMA box..

Before:
root cgroup
real 1m6.987s 1.00

cgroup vs 1 groups of 1 hog
real 1m20.871s 1.20

cgroup vs 2 groups of 1 hog
real 1m48.803s 1.62

Each single task group receives a ~full socket because the kbuild
has become an essentially massless object that fits in practically
no space at all. Near perfect math led directly to far from good
scaling/performance, a "Perfect is the enemy of good" poster child.

After "Let's just be nice enough instead" adjustment, single task
groups continued to sustain >99% utilization while competing with
the box sized kbuild.

cgroup vs 2 groups of 1 hog
real 1m8.151s 1.01 192/190=1.01

Good enough works better.. nearly perfectly in this case.

Signed-off-by: Mike Galbraith <umgwanakikbuit@xxxxxxxxx>
---
kernel/sched/fair.c | 22 ++++++++++++++++++----
kernel/sched/features.h | 3 +++
2 files changed, 21 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -2464,17 +2464,28 @@ static inline long calc_tg_weight(struct

static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
{
- long tg_weight, load, shares;
+ long tg_weight, load, shares, min_shares = MIN_SHARES;

- tg_weight = calc_tg_weight(tg, cfs_rq);
+ if (!sched_feat(SMP_NICE_GROUPS))
+ return tg->shares;
+
+ /*
+ * Bound niceness to prevent everything that wants to scale from
+ * essentially becoming SCHED_IDLE on multi/large socket boxen,
+ * screwing up our ability to distribute load properly and/or
+ * deliver acceptable latencies.
+ */
+ tg_weight = min_t(long, calc_tg_weight(tg, cfs_rq), sched_prio_to_weight[10]);
load = cfs_rq->load.weight;

shares = (tg->shares * load);
if (tg_weight)
shares /= tg_weight;

- if (shares < MIN_SHARES)
- shares = MIN_SHARES;
+ if (tg->shares > sched_prio_to_weight[20])
+ min_shares = sched_prio_to_weight[20];
+ if (shares < min_shares)
+ shares = min_shares;
if (shares > tg->shares)
shares = tg->shares;

@@ -2517,6 +2528,9 @@ static void update_cfs_shares(struct cfs
#ifndef CONFIG_SMP
if (likely(se->load.weight == tg->shares))
return;
+#else
+ if (!sched_feat(SMP_NICE_GROUPS) && se->load.weight == tg->shares)
+ return;
#endif
shares = calc_cfs_shares(cfs_rq, tg);

Index: linux-2.6/kernel/sched/features.h
===================================================================
--- linux-2.6.orig/kernel/sched/features.h
+++ linux-2.6/kernel/sched/features.h
@@ -69,3 +69,6 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
SCHED_FEAT(ATTACH_AGE_LOAD, true)

+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+SCHED_FEAT(SMP_NICE_GROUPS, true)
+#endif