RE: [PATCH] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

From: Song Bao Hua (Barry Song)
Date: Wed Feb 03 2021 - 05:29:02 EST




> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Wednesday, February 3, 2021 11:18 PM
> To: 'Valentin Schneider' <valentin.schneider@xxxxxxx>;
> vincent.guittot@xxxxxxxxxx; mgorman@xxxxxxx; mingo@xxxxxxxxxx;
> peterz@xxxxxxxxxxxxx; dietmar.eggemann@xxxxxxx; morten.rasmussen@xxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx
> Cc: linuxarm@xxxxxxxxxxxxx; xuwei (O) <xuwei5@xxxxxxxxxx>; Liguozhu (Kenneth)
> <liguozhu@xxxxxxxxxxxxx>; tiantao (H) <tiantao6@xxxxxxxxxxxxx>; wanghuiqiang
> <wanghuiqiang@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Jonathan
> Cameron <jonathan.cameron@xxxxxxxxxx>; guodong.xu@xxxxxxxxxx; Meelis Roos
> <mroos@xxxxxxxx>
> Subject: RE: [PATCH] sched/topology: fix the issue groups don't span
> domain->span for NUMA diameter > 2
>
>
>
> > -----Original Message-----
> > From: Valentin Schneider [mailto:valentin.schneider@xxxxxxx]
> > Sent: Wednesday, February 3, 2021 4:17 AM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>;
> > vincent.guittot@xxxxxxxxxx; mgorman@xxxxxxx; mingo@xxxxxxxxxx;
> > peterz@xxxxxxxxxxxxx; dietmar.eggemann@xxxxxxx; morten.rasmussen@xxxxxxx;
> > linux-kernel@xxxxxxxxxxxxxxx
> > Cc: linuxarm@xxxxxxxxxxxxx; xuwei (O) <xuwei5@xxxxxxxxxx>; Liguozhu
> (Kenneth)
> > <liguozhu@xxxxxxxxxxxxx>; tiantao (H) <tiantao6@xxxxxxxxxxxxx>;
> wanghuiqiang
> > <wanghuiqiang@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Jonathan
> > Cameron <jonathan.cameron@xxxxxxxxxx>; guodong.xu@xxxxxxxxxx; Song Bao Hua
> > (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>; Meelis Roos <mroos@xxxxxxxx>
> > Subject: Re: [PATCH] sched/topology: fix the issue groups don't span
> > domain->span for NUMA diameter > 2
> >
> > On 01/02/21 16:38, Barry Song wrote:
> > > @@ -964,6 +941,12 @@ static void init_overlap_sched_group(struct
> sched_domain
> > *sd,
> > >
> > > build_balance_mask(sd, sg, mask);
> > > cpu = cpumask_first_and(sched_group_span(sg), mask);
> > > + /*
> > > + * for the group generated by grandchild, use the sgc of 2nd cpu
> > > + * because the 1st cpu might be used by another sched_group
> > > + */
> > > + if (from_grandchild && cpumask_weight(mask) > 1)
> > > + cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
> > >
> > > sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> >
> > So you are getting a (hopefully) unique ID for this group span at this
> > given topology level (i.e. sd->private) but as I had stated in that list of
> > issues, this creates an sgc that isn't attached to the local group of any
> > sched_domain, and thus won't get its capacity values updated.
> >
> > This can actually be seen via the capacity values you're getting at build
> > time:
> >
> > > [ 0.868907] CPU0 attaching sched-domain(s):
> > ...
> > > [ 0.869542] domain-2: span=0-5 level=NUMA
> > > [ 0.869559] groups: 0:{ span=0-3 cap=4002 }, 5:{ span=4-5 cap=2048 }
> > ^^^^^^^^^^^^^^^^
> > > [ 0.871177] CPU4 attaching sched-domain(s):
> > ...
> > > [ 0.871200] groups: 4:{ span=4 cap=977 }, 5:{ span=5 cap=1001 }
> > > [ 0.871243] domain-1: span=4-7 level=NUMA
> > > [ 0.871257] groups: 4:{ span=4-5 cap=1978 }, 6:{ span=6-7 cap=1968 }
> > ^^^^^^^^^^^^^^^^
> >
>
> Yes. I could see this issue. We could hack update_group_capacity to let
> it scan both local_group and sched_group generated by grandchild, but it
> seems your edit is much better.
>
> > IMO what we want to do here is to hook this CPU0-domain-2-group5 to the sgc
> > of CPU4-domain1-group4. I've done that in the below diff - this gives us
> > groups with sgc's owned at lower topology levels, but this will only ever
> > be true for non-local groups. This has the added benefit of working with
> > single-CPU nodes. Briefly tested on your topology and the sunfire's (via
> > QEMU), and I didn't get screamed at.
> >
> > Before the fun police comes and impounds my keyboard, I'd like to point out
> > that we could leverage this cross-level sgc referencing hack to further
> > change the NUMA domains and pretty much get rid of overlapping groups
> > (that's what I was fumbling with in [1]).
> >
> > [1]: http://lore.kernel.org/r/jhjwnw11ak2.mognet@xxxxxxx
> >
> > That is, rather than building overlapping groups and fixing them whenever
> > that breaks (distance > 2), we could have:
> > - the local group being the child domain's span (as always)
> > - all non-local NUMA groups spanning a single node each, with the right sgc
> > cross-referencing.
> >
> > Thoughts?
>
> I guess the original purpose of overlapping groups is creating as few groups
> as possible. If we totally remove overlapping groups, it seems we will create
> much more groups?
> For example, while node0 begins to build sched_domain for distance 20, it will
> add node2, since the distance between node2 and node3 is 15, so while node2
> is
> added, node3 is also added as node2's lower domain has covered node3. So we
> need
> two groups only for node0's sched_domain of distance level 20.
> +-------+ +--------+
> | | 15 | |
> | node0+----------------+ | node1 |
> | | | |
> +----+--+ XXX--------+
> | XXX
> | XX
> 20 | 15 XX
> | XXX
> | X XXX
> +----+----XXX +-------+
> | | 15 | node3|
> | node2 +-----------------+ |
> | | +-------+
> +---------+
>

Sorry for missing a line:

node0-node3: 20

20
X XX X X X X X X
XXXX X X X XX
XX XXX
XX X
XX XX
+-----X-+ +--------+ XX
| | 15 | | X
| node0+----------------+ | node1 | X
| | | | X
+----+--+ XXX--------+ X
| XXX XX
| XX XX
20 | 15 XX XXXX
| XXX XXXX
| X XXX XXXX
+----+----XXX +-------+ XXXX
| | 15 | node3|XX
| node2 +-----------------+ |
| | +-------+
+---------+

> If we remove overlapping group, we will add a group for node2, another
> group for node3. Then we get three groups.
>
> I am not sure if it is always positive for performance.
>
> >
> > --->8---
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index b748999c9e11..ef43abb6b1fb 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -932,21 +932,15 @@ build_group_from_child_sched_domain(struct
> sched_domain
> > *sd, int cpu)
> >
> > static void init_overlap_sched_group(struct sched_domain *sd,
> > struct sched_group *sg,
> > - int from_grandchild)
> > + struct sched_domain *grandchild)
> > {
> > struct cpumask *mask = sched_domains_tmpmask2;
> > - struct sd_data *sdd = sd->private;
> > + struct sd_data *sdd = grandchild ? grandchild->private : sd->private;
> > struct cpumask *sg_span;
> > int cpu;
> >
> > build_balance_mask(sd, sg, mask);
> > cpu = cpumask_first_and(sched_group_span(sg), mask);
> > - /*
> > - * for the group generated by grandchild, use the sgc of 2nd cpu
> > - * because the 1st cpu might be used by another sched_group
> > - */
> > - if (from_grandchild && cpumask_weight(mask) > 1)
> > - cpu = cpumask_next_and(cpu, sched_group_span(sg), mask);
> >
> > sg->sgc = *per_cpu_ptr(sdd->sgc, cpu);
> > if (atomic_inc_return(&sg->sgc->ref) == 1)
> > @@ -979,7 +973,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int
> > cpu)
> >
> > for_each_cpu_wrap(i, span, cpu) {
> > struct cpumask *sg_span;
> > - int from_grandchild = 0;
> > + bool from_grandchild = false;
> >
> > if (cpumask_test_cpu(i, covered))
> > continue;
> > @@ -1033,7 +1027,7 @@ build_overlap_sched_groups(struct sched_domain *sd,
> int
> > cpu)
> > !cpumask_subset(sched_domain_span(sibling->child),
> > span)) {
> > sibling = sibling->child;
> > - from_grandchild = 1;
> > + from_grandchild = true;
> > }
> >
> > sg = build_group_from_child_sched_domain(sibling, cpu);
> > @@ -1043,7 +1037,7 @@ build_overlap_sched_groups(struct sched_domain *sd,
> int
> > cpu)
> > sg_span = sched_group_span(sg);
> > cpumask_or(covered, covered, sg_span);
> >
> > - init_overlap_sched_group(sd, sg, from_grandchild);
> > + init_overlap_sched_group(sd, sg, from_grandchild ? sibling : NULL);
> >
> Nice edit!
> Will merge your edit into v1 and send v2.
>
> > if (!first)
> > first = sg;
>
Thanks
Barry