RE: [PATCH v2] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

From: Song Bao Hua (Barry Song)
Date: Wed Feb 10 2021 - 07:30:22 EST




> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@xxxxxxxxxxxxx]
> Sent: Thursday, February 11, 2021 12:22 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
> Cc: valentin.schneider@xxxxxxx; vincent.guittot@xxxxxxxxxx; mgorman@xxxxxxx;
> mingo@xxxxxxxxxx; dietmar.eggemann@xxxxxxx; morten.rasmussen@xxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; linuxarm@xxxxxxxxxxxxx; xuwei (O)
> <xuwei5@xxxxxxxxxx>; Liguozhu (Kenneth) <liguozhu@xxxxxxxxxxxxx>; tiantao (H)
> <tiantao6@xxxxxxxxxxxxx>; wanghuiqiang <wanghuiqiang@xxxxxxxxxx>; Zengtao (B)
> <prime.zeng@xxxxxxxxxxxxx>; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>;
> guodong.xu@xxxxxxxxxx; Meelis Roos <mroos@xxxxxxxx>
> Subject: Re: [PATCH v2] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
>
> On Tue, Feb 09, 2021 at 08:58:15PM +0000, Song Bao Hua (Barry Song) wrote:
>
> > > I've finally had a moment to think about this, would it make sense to
> > > also break up group: node0+1, such that we then end up with 3 groups of
> > > equal size?
> >
>
> > Since the sched_domain[n-1] of a part of node[m]'s siblings are able
> > to cover the whole span of sched_domain[n] of node[m], there is no
> > necessity to scan over all siblings of node[m], once sched_domain[n]
> > of node[m] has been covered, we can stop making more sched_groups. So
> > the number of sched_groups is small.
> >
> > So historically, the code has never tried to make sched_groups result
> > in equal size. And it permits the overlapping of local group and remote
> > groups.
>
> Histrorically groups have (typically) always been the same size though.

This is probably true for other platforms. But unfortunately it has never
been true in my platform :-)

node 0 1 2 3
0: 10 12 20 22
1: 12 10 22 24
2: 20 22 10 12
3: 22 24 12 10

In case we have only two cpus in one numa.

CPU0's domain-3 has no overflowed sched_group, but its first group
covers 0-5(node0-node2), the second group covers 4-7
(node2-node3):

[ 0.802139] CPU0 attaching sched-domain(s):
[ 0.802193] domain-0: span=0-1 level=MC
[ 0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 }
[ 0.802693] domain-1: span=0-3 level=NUMA
[ 0.802731] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 }
[ 0.802811] domain-2: span=0-5 level=NUMA
[ 0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 }
[ 0.802881] ERROR: groups don't span domain->span
[ 0.803058] domain-3: span=0-7 level=NUMA
[ 0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 }


>
> The reason I did ask is because when you get one large and a bunch of
> smaller groups, the load-balancing 'pull' is relatively smaller to the
> large groups.
>
> That is, IIRC should_we_balance() ensures only 1 CPU out of the group
> continues the load-balancing pass. So if, for example, we have one group
> of 4 CPUs and one group of 2 CPUs, then the group of 2 CPUs will pull
> 1/2 times, while the group of 4 CPUs will pull 1/4 times.
>
> By making sure all groups are of the same level, and thus of equal size,
> this doesn't happen.

As you can see, even if we give all groups of domain2 equal size
by breaking up both local_group and remote_groups, we will get to
the same problem in domain-3. And what's more tricky is that
domain-3 has no problem of "groups don't span domain->span".

It seems we need to change both domain2 and domain3 then though
domain3 has no issue of "groups don't span domain->span".

Thanks
Barry