Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks

From: Andrea Righi
Date: Fri Dec 20 2024 - 13:27:29 EST


On Fri, Dec 20, 2024 at 08:48:55AM -0800, Yury Norov wrote:
> On Tue, Dec 17, 2024 at 10:32:28AM +0100, Andrea Righi wrote:
> > Using a single global idle mask can lead to inefficiencies and a lot of
> > stress on the cache coherency protocol on large systems with multiple
> > NUMA nodes, since all the CPUs can create a really intense read/write
> > activity on the single global cpumask.
> >
> > Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> > to improve scalability and performance on large systems.
> >
> > The concept is that each cpumask will track only the idle CPUs within
> > its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> > In this way concurrent access to the idle cpumask will be restricted
> > within each NUMA node.
> >
> > NOTE: with per-node idle cpumasks enabled scx_bpf_get_idle_cpu/smtmask()
> > are returning the cpumask of the current NUMA node, instead of a single
> > cpumask for all the CPUs.
> >
> > Signed-off-by: Andrea Righi <arighi@xxxxxxxxxx>
> > ---
> > kernel/sched/ext.c | 281 +++++++++++++++++++++++++++++++++------------
> > 1 file changed, 209 insertions(+), 72 deletions(-)
> >
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index a17abd2df4d4..d4666db4a212 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -894,6 +894,7 @@ static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
> > #ifdef CONFIG_SMP
> > static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc);
> > static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa);
> > +static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_per_node);
> > #endif
> >
> > static struct static_key_false scx_has_op[SCX_OPI_END] =
> > @@ -937,11 +938,82 @@ static struct delayed_work scx_watchdog_work;
> > #define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
> > #endif
> >
> > -static struct {
> > +struct idle_cpumask {
> > cpumask_var_t cpu;
> > cpumask_var_t smt;
> > -} idle_masks CL_ALIGNED_IF_ONSTACK;
> > +};
> > +
> > +/*
> > + * Make sure a NUMA node is in a valid range.
> > + */
> > +static int validate_node(int node)
> > +{
> > + /* If no node is specified, return the current one */
> > + if (node == NUMA_NO_NODE)
> > + return numa_node_id();
> > +
> > + /* Make sure node is in the range of possible nodes */
> > + if (node < 0 || node >= num_possible_nodes())
> > + return -EINVAL;
> > +
> > + return node;
> > +}
>
> This is needed in BPF code, right? Kernel users should always provide
> correct parameters.

This should be used only in the kfuncs, I should probably move this down to
the BPF helpers section. For internal kernel functions we may want to
WARN_ON_ONCE().

>
> > +/*
> > + * cpumasks to track idle CPUs within each NUMA node.
> > + *
> > + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> > + * from node 0 is used to track all idle CPUs system-wide.
> > + */
> > +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;
> > +
> > +static struct cpumask *get_idle_mask_node(int node, bool smt)
>
> Like Rasmus said in another thread, this 'get' prefix looks weird.
>
> > +/**
> > + * get_parity8 - get the parity of an u8 value
>
> I know it's prototypical bikeshedding, but what's with the "get_"
> prefix? Certainly the purpose of any pure function is to "get" the
> result of some computation. We don't have "get_strlen()".
>
> Please either just "parity8", or if you want to emphasize that this
> belongs in the bit-munging family, "bit_parity8".
>
> So maybe idle_nodes(), idle_smt_nodes() and so on?

It's a bit different here, because in theory we get a "reference" to the
object (even if we're not refcounting it), it's not a pure function.

But I'm also totally fine to get rid of the "get_" prefix.

>
> > +{
> > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > + return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> > +
> > + node = validate_node(node);
> > + if (node < 0)
> > + return cpu_none_mask;
>
> cpu_none_mask is a valid pointer. How should user distinguish invalid
> parameter and empty nodemask? You should return -EINVAL.
>
> Anyways...
>
> This would make a horrid mess. Imagine I'm your user, and I'm
> experimenting with this idle feature. I decided to begin with
> per-node idle masks disabled and call it for example like:
>
> get_idle_mask_node(random(), true)
>
> And it works! I get all my idle CPUs just well. So I'm happily build
> my complex and huge codebase on top of it.
>
> Then one day I decide to enable those fancy per-node idle masks
> because you said they unload interconnect and whatever... At that
> point what I want is to click a button just to collect some perf
> numbers. So I do that, and I find all my codebase coredumped in some
> random place... Because per-node idle masks basic functions
> simply have different interface: they disallow node = random() as
> parameter, while global idle mask is OK with it.

So, the idea here is to use validate_node() to catch incorrect nodes
provided by the user (scx scheduler) and trigger an scx_ops_error(), which
will force the scheduler to exit with an error.

In this context cpu_none_mask serves as a "temporary harmless value" that
can be returned to the caller, that will just exit.

Again, we could add a sanity check for the kernel code here as well, like a
WARN_ON_ONCE() to catch potential incorrect values that might be used by
other internal kernel functions (and still return cpu_none_mask).

>
> > +
> > + return smt ? idle_masks[node]->smt : idle_masks[node]->cpu;
> > +}
> > +
> > +static struct cpumask *get_idle_cpumask_node(int node)
>
> inline or __always_inline?

Ok.

> > +
> > +static s32
> > +scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
>
> We have a sort of convention here. If you pick a cpu based on NUMA
> distances, the 'numa' should be a 2nd prefix after a subsystem where
> it's used. Refer for example:
>
> sched_numa_find_nth_cpu()
> sched_numa_hop_mask()
> for_each_numa_hop_mask()
>
> So I'd name it scx_numa_idle_cpu()

Ah, that's good to know, thanks!

(the _numa variant has been removed in v8, but it's still good to know)

>
> > +{
> > + nodemask_t hop_nodes = NODE_MASK_NONE;
> > + int start_node = cpu_to_node(prev_cpu);
> > + s32 cpu = -EBUSY;
> > +
> > + /*
> > + * Traverse all online nodes in order of increasing distance,
> > + * starting from prev_cpu's node.
> > + */
> > + rcu_read_lock();
>
> RCU is a leftover from for_each_numa_hop_mask(), I guess?

Correct! Already removed in the next version.

>
> > + for_each_numa_hop_node(node, start_node, hop_nodes, N_ONLINE) {
> > + cpu = scx_pick_idle_cpu_from_node(node, cpus_allowed, flags);
> > + if (cpu >= 0)
> > + break;
> > + }
> > + rcu_read_unlock();
> > +
> > + return cpu;
> > +}
> > +
> > +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> > +{
> > + /*
> > + * Only node 0 is used if per-node idle cpumasks are disabled.
> > + */
> > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> > + return scx_pick_idle_cpu_from_node(0, cpus_allowed, flags);
> > +
> > + return scx_pick_idle_cpu_numa(cpus_allowed, prev_cpu, flags);
> > }
> >
> > /*
> > @@ -3339,7 +3453,7 @@ static bool llc_numa_mismatch(void)
> > * CPU belongs to a single LLC domain, and that each LLC domain is entirely
> > * contained within a single NUMA node.
> > */
> > -static void update_selcpu_topology(void)
> > +static void update_selcpu_topology(struct sched_ext_ops *ops)
> > {
> > bool enable_llc = false, enable_numa = false;
> > unsigned int nr_cpus;
> > @@ -3360,6 +3474,14 @@ static void update_selcpu_topology(void)
> > if (nr_cpus > 0) {
> > if (nr_cpus < num_online_cpus())
> > enable_llc = true;
> > + /*
> > + * No need to enable LLC optimization if the LLC domains are
> > + * perfectly overlapping with the NUMA domains when per-node
> > + * cpumasks are enabled.
> > + */
> > + if ((ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE) &&
> > + !llc_numa_mismatch())
> > + enable_llc = false;
> > pr_debug("sched_ext: LLC=%*pb weight=%u\n",
> > cpumask_pr_args(llc_span(cpu)), llc_weight(cpu));
> > }
> > @@ -3395,6 +3517,14 @@ static void update_selcpu_topology(void)
> > static_branch_enable_cpuslocked(&scx_selcpu_topo_numa);
> > else
> > static_branch_disable_cpuslocked(&scx_selcpu_topo_numa);
> > +
> > + /*
> > + * Check if we need to enable per-node cpumasks.
> > + */
> > + if (ops->flags & SCX_OPS_BUILTIN_IDLE_PER_NODE)
> > + static_branch_enable_cpuslocked(&scx_builtin_idle_per_node);
> > + else
> > + static_branch_disable_cpuslocked(&scx_builtin_idle_per_node);
> > }
> >
> > /*
> > @@ -3415,6 +3545,8 @@ static void update_selcpu_topology(void)
> > * 4. Pick a CPU within the same NUMA node, if enabled:
> > * - choose a CPU from the same NUMA node to reduce memory access latency.
> > *
> > + * 5. Pick any idle CPU usable by the task.
> > + *
>
> I would make it a separate patch, or even send it independently. It
> doesn't look related to the series...

Right, it's a separate preparation patch in the next version. But I'll
probably send this as a totally separate patch.

> > static void reset_idle_masks(void)
> > {
> > + int node;
> > +
> > + if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> > + cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> > + cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> > + return;
> > + }
> > +
> > /*
> > * Consider all online cpus idle. Should converge to the actual state
> > * quickly.
> > */
> > - cpumask_copy(idle_masks.cpu, cpu_online_mask);
> > - cpumask_copy(idle_masks.smt, cpu_online_mask);
> > + for_each_node_state(node, N_POSSIBLE) {
>
> for_each_node(node)
>
> If the comment above is still valid, you don't need to visit every
> possible node. You need to visit only N_ONLINE, or even N_CPU nodes.
>
> This adds to the discussion in patch #1. Taking care of CPU-less nodes
> is useless for the scheduler purposes. And if you don't even initialize
> them, then in for_each_numa_hop_node() you should skip those nodes.

Don't we need to worry about nodes going online/offline and dealing with
mem_hotplug_lock as mentioned by Tejun?

Maybe it's safer to initialize all of them here and visit the N_CPU nodes
when looking for idle CPUs? Or am I missing something?

Or maybe we could even initialize and navigate the N_CPU nodes and trigger
a scheduler restart on hotplug events if SCX_OPS_BUILTIN_IDLE_PER_NODE is
enabled...

>
> > + const struct cpumask *node_mask = cpumask_of_node(node);
> > + struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> > + struct cpumask *idle_smt = get_idle_smtmask_node(node);
> > +
> > + cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> > + cpumask_copy(idle_smt, idle_cpu);
> > + }
> > }
> >
> > void __scx_update_idle(struct rq *rq, bool idle)
> > {
> > int cpu = cpu_of(rq);
> > + int node = cpu_to_node(cpu);
> > + struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> >
> > if (SCX_HAS_OP(update_idle) && !scx_rq_bypassing(rq)) {
> > SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
> > @@ -3661,27 +3810,25 @@ void __scx_update_idle(struct rq *rq, bool idle)
> > return;
> > }
> >
> > - if (idle)
> > - cpumask_set_cpu(cpu, idle_masks.cpu);
> > - else
> > - cpumask_clear_cpu(cpu, idle_masks.cpu);
> > + assign_cpu(cpu, idle_cpu, idle);
>
> Can you also make it a separate preparation patch?

Yep, done in the next version.

>
> >
> > #ifdef CONFIG_SCHED_SMT
> > if (sched_smt_active()) {
> > const struct cpumask *smt = cpu_smt_mask(cpu);
> > + struct cpumask *idle_smt = get_idle_smtmask_node(node);
> >
> > if (idle) {
> > /*
> > - * idle_masks.smt handling is racy but that's fine as
> > - * it's only for optimization and self-correcting.
> > + * idle_smt handling is racy but that's fine as it's
> > + * only for optimization and self-correcting.
> > */
> > for_each_cpu(cpu, smt) {
> > - if (!cpumask_test_cpu(cpu, idle_masks.cpu))
> > + if (!cpumask_test_cpu(cpu, idle_cpu))
> > return;
> > }
>
> So, we continue only if all smt cpus are idle. This looks like an
> opencoded cpumask_subset():
>
> if (!cpumask_subset(idle_cpu, smt))
> return;
>
> Is that true? If so, can you please again make it a small
> preparation patch?

Good point, if all the CPUs in smt are all idle (== smt is a subset of
idle_cpu), we want to set all the smt bits in idle_smt. So I think the
following should be correct:

if (!cpumask_subset(smt, idle_cpu))
return;
cpumask_or(idle_smt, idle_smt, smt);

Will test this & make a preparation patch.

Thanks!
-Andrea