Re: [PATCH] sched/topology: improve topology_span_sane speed

From: Steve Wahl
Date: Tue Oct 15 2024 - 18:20:42 EST


On Tue, Oct 15, 2024 at 03:50:49PM +0530, K Prateek Nayak wrote:
> Hello Steve,
>
> On 10/10/2024 9:21 PM, Steve Wahl wrote:
> > Use a different approach to topology_span_sane(), that checks for the
> > same constraint of no partial overlaps for any two CPU sets for
> > non-NUMA topology levels, but does so in a way that is O(N) rather
> > than O(N^2).
> >
> > Instead of comparing with all other masks to detect collisions, keep
> > one mask that includes all CPUs seen so far and detect collisions with
> > a single cpumask_intersects test.
> >
> > If the current mask has no collisions with previously seen masks, it
> > should be a new mask, which can be uniquely identified by the lowest
> > bit set in this mask. Keep a pointer to this mask for future
> > reference (in an array indexed by the lowest bit set), and add the
> > CPUs in this mask to the list of those seen.
> >
> > If the current mask does collide with previously seen masks, it should
> > be exactly equal to a mask seen before, looked up in the same array
> > indexed by the lowest bit set in the mask, a single comparison.
> >
> > Move the topology_span_sane() check out of the existing topology level
> > loop, let it use its own loop so that the array allocation can be done
> > only once, shared across levels.
> >
> > On a system with 1920 processors (16 sockets, 60 cores, 2 threads),
> > the average time to take one processor offline is reduced from 2.18
> > seconds to 1.01 seconds. (Off-lining 959 of 1920 processors took
> > 34m49.765s without this change, 16m10.038s with this change in place.)
> >
> > Signed-off-by: Steve Wahl <steve.wahl@xxxxxxx>
> > ---
> > kernel/sched/topology.c | 79 ++++++++++++++++++++++++++++-------------
> > 1 file changed, 54 insertions(+), 25 deletions(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9748a4c8d668..c150074033d3 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2356,36 +2356,65 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
> > /*
> > * Ensure topology masks are sane, i.e. there are no conflicts (overlaps) for
> > - * any two given CPUs at this (non-NUMA) topology level.
> > + * any two given CPUs on non-NUMA topology levels.
> > */
> > -static bool topology_span_sane(struct sched_domain_topology_level *tl,
> > - const struct cpumask *cpu_map, int cpu)
> > +static bool topology_span_sane(const struct cpumask *cpu_map)
> > {
> > - int i = cpu + 1;
> > + struct sched_domain_topology_level *tl;
> > + const struct cpumask **masks;
> > + struct cpumask *covered;
> > + int cpu, id;
> > + bool ret = false;
> > - /* NUMA levels are allowed to overlap */
> > - if (tl->flags & SDTL_OVERLAP)
> > - return true;
> > + lockdep_assert_held(&sched_domains_mutex);
> > + covered = sched_domains_tmpmask;
> > +
> > + masks = kmalloc_array(num_possible_cpus(), sizeof(struct cpumask *), GFP_KERNEL);
> > + if (!masks)
> > + return ret;
>
> That looks like a very large array that seems unnecessary. Instead, is
> it possible to use "tl->mask(id)" down blow to check for equality? (I'll
> elaborate more below)
>
> > +
> > + for_each_sd_topology(tl) {
> > +
> > + /* NUMA levels are allowed to overlap */
> > + if (tl->flags & SDTL_OVERLAP)
> > + continue;
> > +
> > + cpumask_clear(covered);
> > + memset(masks, 0, num_possible_cpus() * sizeof(struct cpumask *));
> > - /*
> > - * Non-NUMA levels cannot partially overlap - they must be either
> > - * completely equal or completely disjoint. Otherwise we can end up
> > - * breaking the sched_group lists - i.e. a later get_group() pass
> > - * breaks the linking done for an earlier span.
> > - */
> > - for_each_cpu_from(i, cpu_map) {
> > /*
> > - * We should 'and' all those masks with 'cpu_map' to exactly
> > - * match the topology we're about to build, but that can only
> > - * remove CPUs, which only lessens our ability to detect
> > - * overlaps
> > + * Non-NUMA levels cannot partially overlap - they must be either
> > + * completely equal or completely disjoint. Otherwise we can end up
> > + * breaking the sched_group lists - i.e. a later get_group() pass
> > + * breaks the linking done for an earlier span.
> > */
> > - if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> > - cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> > - return false;
> > + for_each_cpu(cpu, cpu_map) {
> > + /* lowest bit set in this mask is used as a unique id */
> > + id = cpumask_first(tl->mask(cpu));
>
> nit. "id" can be declared in this scope since it is not used anywhere
> outside. Also perhaps you can store "tl->mask(cpu)" instead of calling
> the function multiple times below. Thoughts?

I think that would work. I'll see if I can work it in.

> > +
> > + /* if this mask doesn't collide with what we've already seen */
> > + if (!cpumask_intersects(tl->mask(cpu), covered)) {
> > + /* this failing would be an error in this algorithm */
> > + if (WARN_ON(masks[id]))
> > + goto notsane;
> > +
> > + /* record the mask we saw for this id */
> > + masks[id] = tl->mask(cpu);
> > + cpumask_or(covered, tl->mask(cpu), covered);
> > + } else if ((!masks[id]) || !cpumask_equal(masks[id], tl->mask(cpu))) {
>
> Elaborating the above point, instead of caching cpumask in an array with:
>
> masks[id] = tl->mask(cpu);
>
> how about, having a single "struct cpumask *id_seen" that is cleared at
> the same time as "covered" cpumask, and a bit corresponding to every
> "id" seen is set. If there is an intersection, and this bit against the
> "id" seen is not set or if the "tl->mask(cpu)" is not same as
> "tl->mask(id)", one can detect if the range is overlapping. This
> requires much less space compared to an array of cpumasks but I'm not
> sure how much more expensive it is to access "tl->mask(id)" compared
> to reading it from the array of cpumasks.

So, I was at first skeptical that this would work, I thought it might
miss some error cases. But after a lot of thought I think I was wrong
and it will work. And it lets me get rid of the array and with it
the use of num_possible_cpus vs. nr_cpu_ids that Peter Z. commented
about.

Thanks, I will re-work and post a new version.

--> Steve

--
Steve Wahl, Hewlett Packard Enterprise