Re: [PATCH 1/5] sched/topology: Introduce for_each_numa_node() iterator
From: Andrea Righi
Date: Fri Feb 07 2025 - 10:46:09 EST
Hi Yury,
On Thu, Feb 06, 2025 at 10:57:19PM -0500, Yury Norov wrote:
> On Thu, Feb 06, 2025 at 09:15:31PM +0100, Andrea Righi wrote:
...
> > @@ -261,6 +267,29 @@ sched_numa_hop_mask(unsigned int node, unsigned int hops)
> > }
> > #endif /* CONFIG_NUMA */
> >
> > +/**
> > + * for_each_numa_node - iterate over NUMA nodes at increasing hop distances
> > + * from a given starting node.
> > + * @node: the iteration variable, representing the current NUMA node.
> > + * @start: the NUMA node to start the iteration from.
> > + * @visited: a nodemask_t to track the visited nodes.
>
> nit: s/nodemask_t/nodemask
The type is actually nodemask_t, do you think it's better to mention
nodemask instead?
>
> > + * @state: state of NUMA nodes to iterate.
> > + *
> > + * This macro iterates over NUMA nodes in increasing distance from
> > + * @start_node and yields MAX_NUMNODES when all the nodes have been
> > + * visited.
> > + *
> > + * The difference between for_each_node() and for_each_numa_node() is that
> > + * the former allows to iterate over nodes in no particular order, whereas
> > + * the latter iterates over nodes in increasing order of distance.
>
> for_each_node iterates them in numerical order.
Oh yes, much better. :)
>
> > + *
> > + * Requires rcu_lock to be held.
> > + */
>
> Please mention complexity here, which is O(N^2).
Ok. Will also add a comment to describe why it's O(N^2).
>
> > +#define for_each_numa_node(node, start, visited, state) \
> > + for (node = start; \
> > + node != MAX_NUMNODES; \
> > + node = sched_numa_node(&(visited), start, state))
>
> Please braces around parameters.
Ok.
>
> 'node < MAX_NUMNODES' is better. It will cost you nothing but will give
> some guarantee that if user passes broken start value, we don't call
> sched_numa_node().
Good point.
>
> What about:
> start = -EINVAL;
> foe_each_numa_node(node, start, visited, N_ONLINE)
> do_something(node);
>
> Whatever garbage user puts in 'start', at the very first iteration,
> do_something() will be executed against it. Offline node, -EINVAL or
> NUMA_NO_NODE are the examples.
So, my idea was actually to use start == NUMA_NO_NODE for all the cases
where the starting node doesn't matter, for example when calling
scx_bpf_pick_idle_cpu(), that doesn't have a prev_cpu context.
Should we implicitly fallback to for_each_node() when
start == NUMA_NO_NODE? And if it's complete garbage maybe just break and
never execute do_something(node)?
>
> > +
> > /**
> > * for_each_numa_hop_mask - iterate over cpumasks of increasing NUMA distance
> > * from a given node.
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index da33ec9e94ab2..e1d0a33415fb5 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2183,6 +2183,48 @@ int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> > }
> > EXPORT_SYMBOL_GPL(sched_numa_find_nth_cpu);
> >
> > +/**
> > + * sched_numa_node - Find the NUMA node at the closest distance from
> > + * node @start.
> > + *
> > + * @visited: a pointer to a nodemask_t representing the visited nodes.
> > + * @start: the node to start the search from.
>
> Maybe just 'node' The 'start' is only meaningful in the caller. Here
> we don't start, we just look for a node that is the most nearest to a
> given one.
Ok.
>
> What if NOMA_NO_NODE is passed?
We could return the first non-visited node in numerical order. And still
fallback to for_each_node() from for_each_numa_node() when
start == NUMA_NO_NODE.
>
> > + * @state: the node state to filter nodes by.
> > + *
> > + * This function iterates over all nodes in the given state and calculates
> > + * the distance to the starting node. It returns the node that is the
> > + * closest in terms of distance that has not already been considered (not
> > + * set in @visited and not the starting node). If the node is found, it is
> > + * marked as visited in the @visited node mask.
> > + *
> > + * Returns the node ID closest in terms of hop distance from the @start
> > + * node, or MAX_NUMNODES if no node is found (or all nodes have been
> > + * visited).
> > + */
> > +int sched_numa_node(nodemask_t *visited, int start, unsigned int state)
>
> The name is somewhat confusing. Sched_numa_node what? If you're looking
> for a closest node, call your function find_closest_node().
>
> We already have numa_nearest_node(). The difference between this one
> and what you're implementing here is that you add an additional mask.
> So, I'd suggest something like
>
> int numa_nearest_node_andnot(int node, unsigned int state, nodemask_t *mask)
>
> Also, there's about scheduler her, so I'd suggest to place it next to
> numa_nearest_node() in mm/mempolicy.c.
Makes sense, and I agree that mm/mempolicy.c is a better place for this.
>
> > +{
> > + int dist, n, min_node, min_dist;
> > +
> > + min_node = MAX_NUMNODES;
> > + min_dist = INT_MAX;
> > +
> > + /* Find the nearest unvisted node */
>
> If you name it properly, you don't need to explain your intentions in
> the code. Also, at this level of abstraction, the 'visited' name may
> be too specific. Let's refer to it as just a mask containing nodes
> that user wants to skip for whatever reason.
Ok.
>
>
> > + for_each_node_state(n, state) {
> > + if (n == start || node_isset(n, *visited))
> > + continue;
>
> Nah.
>
> 1. Skipping start node is wrong. The very first call should return
> 'start' exactly. Then it will be masked out, so the traversing will
> move forward.
> 2. This should be for_each_node_state_andnot(n, state, mask).
>
> This way you'll be able to drop the above condition entirely.
Yeah, I agree, I'll revisit this, also considering the comments above.
>
> > + dist = node_distance(start, n);
> > + if (dist < min_dist) {
> > + min_dist = dist;
> > + min_node = n;
> > + }
> > + }
> > + if (min_node != MAX_NUMNODES)
> > + node_set(min_node, *visited);
>
> Is it possible to set the 'visited' bit in caller code? This way we'll
> make the function pure, like other bitmap search functions.
>
> Would something like this work?
>
> int numa_nearest_node_andnot(int node, unsigned int state, const nodemask_t *mask);
> #define for_each_numa_node(node, visited, state) \
> for (int start = (node), n = numa_nearest_node_andnot(start, (state), &(visited)); \
> n < MAX_NUMNODES; \
> node_set(n, (visited)), n = numa_nearest_node_andnot(start, (state), &(visited)))
>
> One other option to think about is that we can introduce numa_nearest_nodemask()
> and use it like this
>
> nodemask_t nodes;
>
> nodes_complement(nodes, node_states[N_ONLINE];
> for_each_numa_node(node, unvisited)
> do_something();
>
> Where:
>
> int numa_nearest_nodemask(int node, const nodemask_t *mask);
> #define for_each_numa_node(node, unvisited) \
> for (int start = (node), n = numa_nearest_nodemask(start, &(unvisited)); \
> n < MAX_NUMNODES; \
> node_clear(n, (visited)), n = numa_nearest_nodemask(start, &(visited)))
I like the numa_nearest_nodemask() idea, I'll do some experiemnts with it.
Thanks!
-Andrea