Re: [PATCH v3] sched/topology: fix the issue groups don't span domain->span for NUMA diameter > 2

From: Valentin Schneider
Date: Tue Feb 09 2021 - 06:31:00 EST


On 09/02/21 21:21, Barry Song wrote:
> Reported-by: Valentin Schneider <valentin.schneider@xxxxxxx>
> Tested-by: Meelis Roos <mroos@xxxxxxxx>
> Signed-off-by: Barry Song <song.bao.hua@xxxxxxxxxxxxx>

Tested on a bunch of NUMA topologies via QEMU (AMD Epyc, D06, Sunfire) and
the results look sane.

Small comment nits below, but regardless:

Reviewed-by: Valentin Schneider <valentin.schneider@xxxxxxx>

> ---
> -v3:
> Mainly updated according to Valentin's comments. While the approach was
> started by me, Valentin contributed the most useful edit and comments.
> Thanks, Valentin!
>

Happy to help, thanks for fixing this!

> * fixed a potential issue that re-used sgc might be located in
> a sched_domain which will be degenrated;
> * code cleanup to make it more readable
>
> While Valentin started another approach which completely removed overlapped
> sched_group, we both agree that it is better to have a solution which won't
> touch machines without 3-hops issue first:
> https://lore.kernel.org/lkml/jhjpn1a232z.mognet@xxxxxxx/
>

I think this doesn't scale properly so I doubt it's ever going to see the
light of day, but I had to write it to convince myself of it :/

> kernel/sched/topology.c | 91 +++++++++++++++++++++++++++--------------
> 1 file changed, 61 insertions(+), 30 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5d3675c7a76b..ab5ebf17f30a 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -982,6 +953,31 @@ static void init_overlap_sched_group(struct sched_domain *sd,
> sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
> }
>
> +static struct sched_domain *
> +find_descended_sibling(struct sched_domain *sd, struct sched_domain *sibling)
> +{
> + /*
> + * The proper descendant would be the one whose child won't span out
> + * of sd
> + */
> + while (sibling->child &&
> + !cpumask_subset(sched_domain_span(sibling->child),
> + sched_domain_span(sd)))
> + sibling = sibling->child;
> +
> + /*
> + * As we are referencing sgc across different topology level, we need
> + * to go down to skip those sched_domains which don't contribute to
> + * scheduling because they will be degenerated in cpu_attach_domain

+ * This is important because we must make sure we point to an sgc that
+ * will be updated via update_group_capacity().

> + */
> + while (sibling->child &&
> + cpumask_equal(sched_domain_span(sibling->child),
> + sched_domain_span(sibling)))
> + sibling = sibling->child;
> +
> + return sibling;
> +}
> +
> static int
> build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> {
> @@ -1015,6 +1011,41 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
> continue;
>
> + /*
> + * Usually we build sched_group by sibling's child sched_domain
> + * But for machines whose NUMA diameter are 3 or above, we move
> + * to build sched_group by sibling's proper descendant's child
> + * domain because sibling's child sched_domain will span out of
> + * the sched_domain being built as below.
> + *
> + * Smallest diameter=3 topology is:
> + *
> + * node 0 1 2 3
> + * 0: 10 20 30 40
> + * 1: 20 10 20 30
> + * 2: 30 20 10 20
> + * 3: 40 30 20 10
> + *
> + * 0 --- 1 --- 2 --- 3
> + *
> + * NUMA-3 0-3 N/A N/A 0-3
> + * groups: {0-2},{1-3} {1-3},{0-2}
> + *
> + * NUMA-2 0-2 0-3 0-3 1-3
> + * groups: {0-1},{1-3} {0-2},{2-3} {1-3},{0-1} {2-3},{0-2}
> + *
> + * NUMA-1 0-1 0-2 1-3 2-3
> + * groups: {0},{1} {1},{2},{0} {2},{3},{1} {3},{2}
> + *
> + * NUMA-0 0 1 2 3
> + *
> + * The NUMA-2 groups for nodes 0 and 3 are obviously buggered, as the
^^^
s/are/would be/

> + * group span isn't a subset of the domain span.
^^^^^
s/isn't/wouldn't be/

> + */
> + if (sibling->child &&
> + !cpumask_subset(sched_domain_span(sibling->child), span))
> + sibling = find_descended_sibling(sd, sibling);
> +
> sg = build_group_from_child_sched_domain(sibling, cpu);
> if (!sg)
> goto fail;
> @@ -1022,7 +1053,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> sg_span = sched_group_span(sg);
> cpumask_or(covered, covered, sg_span);
>
> - init_overlap_sched_group(sd, sg);
> + init_overlap_sched_group(sibling, sg);
>
> if (!first)
> first = sg;
> --
> 2.25.1