RE: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

From: Song Bao Hua (Barry Song)
Date: Tue Feb 09 2021 - 19:39:30 EST




> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@xxxxxxxxxxxxx]
> Sent: Wednesday, February 10, 2021 1:56 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
> Cc: valentin.schneider@xxxxxxx; vincent.guittot@xxxxxxxxxx; mgorman@xxxxxxx;
> mingo@xxxxxxxxxx; dietmar.eggemann@xxxxxxx; morten.rasmussen@xxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; linuxarm@xxxxxxxxxxxxx; xuwei (O)
> <xuwei5@xxxxxxxxxx>; Liguozhu (Kenneth) <liguozhu@xxxxxxxxxxxxx>; tiantao (H)
> <tiantao6@xxxxxxxxxxxxx>; wanghuiqiang <wanghuiqiang@xxxxxxxxxx>; Zengtao (B)
> <prime.zeng@xxxxxxxxxxxxx>; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>;
> guodong.xu@xxxxxxxxxx; Meelis Roos <mroos@xxxxxxxx>
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
>
> On Thu, Feb 04, 2021 at 12:12:01AM +1300, Barry Song wrote:
> > As long as NUMA diameter > 2, building sched_domain by sibling's child
> > domain will definitely create a sched_domain with sched_group which will
> > span out of the sched_domain:
> >
> > +------+ +------+ +-------+ +------+
> > | node | 12 |node | 20 | node | 12 |node |
> > | 0 +---------+1 +--------+ 2 +-------+3 |
> > +------+ +------+ +-------+ +------+
> >
> > domain0 node0 node1 node2 node3
> >
> > domain1 node0+1 node0+1 node2+3 node2+3
> > +
> > domain2 node0+1+2 |
> > group: node0+1 |
> > group:node2+3 <-------------------+
> >
> > when node2 is added into the domain2 of node0, kernel is using the child
> > domain of node2's domain2, which is domain1(node2+3). Node 3 is outside
> > the span of the domain including node0+1+2.
> >
> > This will make load_balance() run based on screwed avg_load and group_type
> > in the sched_group spanning out of the sched_domain, and it also makes
> > select_task_rq_fair() pick an idle CPU out of the sched_domain.
> >
> > Real servers which suffer from this problem include Kunpeng920 and 8-node
> > Sun Fire X4600-M2, at least.
> >
> > Here we move to use the *child* domain of the *child* domain of node2's
> > domain2 as the new added sched_group. At the same time, we re-use the
> > lower level sgc directly.
> >
> > +------+ +------+ +-------+ +------+
> > | node | 12 |node | 20 | node | 12 |node |
> > | 0 +---------+1 +--------+ 2 +-------+3 |
> > +------+ +------+ +-------+ +------+
> >
> > domain0 node0 node1 +- node2 node3
> > |
> > domain1 node0+1 node0+1 | node2+3 node2+3
> > |
> > domain2 node0+1+2 |
> > group: node0+1 |
> > group:node2 <-------------------+
> >
>
> I've finally had a moment to think about this, would it make sense to
> also break up group: node0+1, such that we then end up with 3 groups of
> equal size?

We used to create the sched_groups of sched_domain[n] of node[m] by
1. local group: sched_domain[n-1] of node[m]
2. remote group: sched_domain[n-1] of node[m]'s siblings
in the same level.
Since the sched_domain[n-1] of a part of node[m]'s siblings are able
to cover the whole span of sched_domain[n] of node[m], there is no
necessity to scan over all siblings of node[m], once sched_domain[n]
of node[m] has been covered, we can stop making more sched_groups. So
the number of sched_groups is small.

So historically, the code has never tried to make sched_groups result
in equal size. And it permits the overlapping of local group and remote
groups.

One issue we are facing in original code is that once the topology
gets to 3-hops NUMA, sched_domain[n-1] of node[m]'s siblings might
span out of the range of sched_domain[n] of node[m]. Here my approach
is trying to find a descanted sibling to build remote groups and fix
this issue for those machines with this problem. So it keeps those
machines without 3-hops issues untouched.

Valentin sent another RFC to break up all remote groups to include
the remote node only instead of using sched_domain[n-1] of siblings,
this will eliminate the problem from the first beginning. One side
effect is that it changes all machines including those machines w/o
3-hops issue by creating much more remote sched_groups. So we both
agree we can get started from descanted sibling(grandchild) approach
first.

What you are advising seems to be breaking up local sched_group,
it will create much more local groups. It sounds like a huge change
even beyond the scope of the original issue we are trying to fix :-)

>
> > w/ patch, we don't get "groups don't span domain->span" any more:
> > [ 1.486271] CPU0 attaching sched-domain(s):
> > [ 1.486820] domain-0: span=0-1 level=MC
> > [ 1.500924] groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 }
> > [ 1.515717] domain-1: span=0-3 level=NUMA
> > [ 1.515903] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 }
> > [ 1.516989] domain-2: span=0-5 level=NUMA
> > [ 1.517124] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 }
>
> groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3, cap=1989 },
> 4:{ span=4-5, cap=1949 }
>
> > [ 1.517369] domain-3: span=0-7 level=NUMA
> > [ 1.517423] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7
> mask=6-7 cap=4054 }
>
> Let me continue to think about this... it's been a while :/

Sure, thanks!

Barry