Re: [PATCH] sched,topology: Update sched topology atomically

From: ying.huang@xxxxxxxxx
Date: Mon Apr 11 2022 - 05:01:38 EST


Hi, Valentin,

Thanks a lot for review!

On Fri, 2022-04-08 at 17:52 +0100, Valentin Schneider wrote:
>
> Hi,
>
> On 06/04/22 15:01, Huang Ying wrote:
> > When Peter Zijlstra reviewed commit 0fb3978b0aac ("sched/numa: Fix
> > NUMA topology for systems with CPU-less nodes") [1], he pointed out
> > that sched_domains_numa_distance and sched_domains_numa_masks are made
> > separate RCU variables.  That could go side-ways if there were a
> > function using both, although there isn't for now.
> >
> > So we update sched_domains_numa_distance and sched_domains_numa_masks
> > and some other related sched topology parameters atomically to address
> > the potential issues.
> >
> > [1] https://lkml.kernel.org/r/20220214121553.582248-1-
> > ying.huang@xxxxxxxxx
> >
>
> I had to re-acquaint myself with [1], but AFAICT this looks mostly
> OK. Have some nits/comments below, I haven't actually tested this yet as
> I'm still setting stuff up over here...
>
> > Signed-off-by: "Huang, Ying" <ying.huang@xxxxxxxxx>
> > Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Cc: Valentin Schneider <valentin.schneider@xxxxxxx>
> > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > Cc: Mel Gorman <mgorman@xxxxxxx>
> > Cc: Rik van Riel <riel@xxxxxxxxxxx>
> > Cc: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
> > ---
> >  kernel/sched/fair.c     |  47 ++++++---
> >  kernel/sched/sched.h    |  14 ++-
> >  kernel/sched/topology.c | 223 +++++++++++++++++++---------------------
> >  3 files changed, 151 insertions(+), 133 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d4bd299d67ab..546e7347fafc 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1294,17 +1294,22 @@ static unsigned long score_nearby_nodes(struct
> > task_struct *p, int nid,
> >                                       int lim_dist, bool task)
> >  {
> >       unsigned long score = 0;
> > -       int node, max_dist;
> > +       int node, max_dist, type;
> > +       struct sched_numa_topology *topology;
> > +
> > +       rcu_read_lock();
> > +       topology = rcu_dereference(sched_numa_topology);
> > +       type = topology->type;
> > +       max_dist = topology->max_distance;
> > +       rcu_read_unlock();
> >
>
> One thing worth pointing out, the NUMA topology type *wasn't* RCU protected
> previously, which I think we've been wanting since
>
>   0083242c9375 ("sched/topology: Skip updating masks for non-online nodes")
>
> (before then the topology was init'd once-and-for-all).
>
> Another thing; this pattern is now repeated a handful of times in fair.c,
> what about bundling it up? As a bonus it lets us keep the definition of
> struct sched_numa_topology internal to topology.c:
>
> void sched_fetch_numa_topology(int *type int *max_distance)
> {
>         struct sched_numa_topology __rcu *topology;
>
>         rcu_read_lock();
>         topology = rcu_dereference(sched_numa_topology);
>
>         if (type)
>                 *type = topology->type;
>
>         if (max_distance)
>                 *max_distance = topology->max_distance;
>
>
>         rcu_read_unlock();
> }
>

This looks good. Will use it in the next version. Thanks!

> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 810750e62118..19523b23034f 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1483,14 +1483,10 @@ static void claim_allocations(int cpu, struct
> > sched_domain *sd)
> >  }
> >
> >  #ifdef CONFIG_NUMA
> > -enum numa_topology_type sched_numa_topology_type;
> > +struct sched_numa_topology     default_numa_topology;
>
> Shouldn't this be static?

Yes. Will change it in the next version.

> > +struct sched_numa_topology     *sched_numa_topology =
> > &default_numa_topology;
>
> This should also be __rcu (sparse should complain about this)

Yes. Will change it in the next version.

> >
> > -static int                     sched_domains_numa_levels;
> >  static int                     sched_domains_curr_level;
> > -
> > -int                            sched_max_numa_distance;
> > -static int                     *sched_domains_numa_distance;
> > -static struct cpumask          ***sched_domains_numa_masks;
> >  #endif
> >
> >  /*
>
> > +static void sched_numa_topology_switch(struct sched_numa_topology
> > *topology,
> > +                                      struct sched_domain_topology_level
> > *tl)
> > +{
> > +       struct sched_numa_topology *old_topology = sched_numa_topology;
> > +       struct sched_domain_topology_level *old_tl =
> > sched_domain_topology;
> > +
> > +       rcu_assign_pointer(sched_numa_topology, topology);
> > +       sched_domain_topology = tl;
> > +
> > +       if (old_topology == &default_numa_topology)
> > +               return;
> > +
> > +       init_rcu_head(&old_topology->rcu);
>
> I'm not familiar with this and I can find only a handful of users, should
> we be doing that for struct sched_domain?

Per my understanding, this is for debug and we should call it before
call_rcu(). Copying Paul for confirmation.

>
> > +       call_rcu(&old_topology->rcu, sched_numa_topology_free_rcu);
> > +       kfree(old_tl);
> > +}
> >
> >  #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
> >
> >  void sched_init_numa(int offline_node)
> >  {
> >       struct sched_domain_topology_level *tl;
> > -       unsigned long *distance_map;
> > +       unsigned long *distance_map = NULL;
> >       int nr_levels = 0;
> >       int i, j;
> > -       int *distances;
> > -       struct cpumask ***masks;
> > +       struct sched_numa_topology *topology;
> > +
> > +       topology = kzalloc(sizeof(*topology), GFP_KERNEL);
> > +       if (!topology)
> > +               goto err;
> >
> >       /*
> >        * O(nr_nodes^2) deduplicating selection sort -- in order to find
> > the
>
> > +       return;
> > +err:
>
> I was expecting the "usual" multi-label setup, but this seems to be working
> too - though not in the most obvious way ;)

OK. Will try to convert it to multi-label.

>
> > +       bitmap_free(distance_map);
> > +       sched_numa_topology_free(topology);
> > +       sched_numa_topology_switch(&default_numa_topology,
> > +                                  sched_domain_topology_default);
> >  }
> >
> >  /*
> > @@ -1981,7 +1977,6 @@ void sched_update_numa(int cpu, bool online)
> >       if (cpumask_weight(cpumask_of_node(node)) != 1)
> >               return;
> >
> > -       sched_reset_numa();
> >       sched_init_numa(online ? NUMA_NO_NODE : node);
> >  }
> >
> > @@ -1990,14 +1985,15 @@ void sched_domains_numa_masks_set(unsigned int
> > cpu)
> >       int node = cpu_to_node(cpu);
> >       int i, j;
> >
> > -       for (i = 0; i < sched_domains_numa_levels; i++) {
> > +       for (i = 0; i < sched_numa_topology->nr_levels; i++) {
> >               for (j = 0; j < nr_node_ids; j++) {
> >                       if (!node_state(j, N_CPU))
> >                               continue;
> >
> >                       /* Set ourselves in the remote node's masks */
> > -                       if (node_distance(j, node) <=
> > sched_domains_numa_distance[i])
> > -                               cpumask_set_cpu(cpu,
> > sched_domains_numa_masks[i][j]);
> > +                       if (node_distance(j, node) <=
> > +                           sched_numa_topology->distances[i])
> > +                               cpumask_set_cpu(cpu, sched_numa_topology-
> > >masks[i][j]);
> will be 0
> I'm guessing the assumption here is that this is done in
> sched_cpu_{de,}activate(), so we have the hotplug lock and so the masks are
> stable... Should we sprinkle in a lockdep_assert_cpus_held() for good
> measure?
>
> Having a second look at this, I think the same applies to your change to
> sd_numa_mask() - those end up used in the domain rebuild which itself might
> be deferred cpuset_hotplug_work, but that should still hold the hotplug
> lock...

Yes. hotplug lock needs to be held in these places. Will add
lockdep_assert_cpus_held() in necessary places.

> >               }
> >       }
> >  }
> > @@ -2006,10 +2002,10 @@ void sched_domains_numa_masks_clear(unsigned int
> > cpu)
> >  {
> >       int i, j;
> >
> > -       for (i = 0; i < sched_domains_numa_levels; i++) {
> > +       for (i = 0; i < sched_numa_topology->nr_levels; i++) {
> >               for (j = 0; j < nr_node_ids; j++) {
> > -                       if (sched_domains_numa_masks[i][j])
> > -                               cpumask_clear_cpu(cpu,
> > sched_domains_numa_masks[i][j]);
> > +                       if (sched_numa_topology->masks[i][j])
> > +                               cpumask_clear_cpu(cpu,
> > sched_numa_topology->masks[i][j]);
> >               }
> >       }
> >  }
> > @@ -2025,22 +2021,19 @@ void sched_domains_numa_masks_clear(unsigned int
> > cpu)
> >  int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
> >  {
> >       int i, j = cpu_to_node(cpu), found = nr_cpu_ids;
> > -       struct cpumask ***masks;
> > +       struct sched_numa_topology *topology;
> >
> >       rcu_read_lock();
> > -       masks = rcu_dereference(sched_domains_numa_masks);
> > -       if (!masks)
> > -               goto unlock;
> > -       for (i = 0; i < sched_domains_numa_levels; i++) {
> > -               if (!masks[i][j])
> > +       topology = rcu_dereference(sched_numa_topology);
> > +       for (i = 0; i < topology->nr_levels; i++) {
> > +               if (!topology->masks[i][j])
>
> If I got default_numa_topology right and it's meant to be static,
> sched_numa_topology->masks can still be NULL, but here this would be gated
> by nr_levels being 0... Did I get that right?

Yes. nr_levels in default_numa_topology is 0.

>
> >                       break;
> > -               cpu = cpumask_any_and(cpus, masks[i][j]);
> > +               cpu = cpumask_any_and(cpus, topology->masks[i][j]);
> >               if (cpu < nr_cpu_ids) {
> >                       found = cpu;
> >                       break;
> >               }
> >       }
> > -unlock:
> >       rcu_read_unlock();
> >
> >       return found;
> > --
> > 2.30.2
>

Best Regards,
Huang, Ying