Re: [PATCH v7] lib: optimize cpumask_local_spread()

From: Dave Hansen
Date: Fri Nov 20 2020 - 12:49:08 EST


On 11/17/20 6:54 PM, Shaokun Zhang wrote:
> From: Yuqi Jin <jinyuqi@xxxxxxxxxx>
>
> In multi-processor and NUMA system, I/O driver will find cpu cores that
> which shall be bound IRQ. When cpu cores in the local numa have been
> used up, it is better to find the node closest to the local numa node
> for performance, instead of choosing any online cpu immediately.
>
> On arm64 or x86 platform that has 2-sockets and 4-NUMA nodes, if the
> network card is located in node2 of socket1, while the number queues
> of network card is greater than the number of cores of node2, when all
> cores of node2 has been bound to the queues, the remaining queues will
> be bound to the cores of node0 which is further than NUMA node3.

That's quite the run-on sentence. :)

> It is
> not friendly for performance or Intel's DDIO (Data Direct I/O Technology)

Could you explain *why* it is not friendly to DDIO specifically? This
patch affects where the interrupt handler runs. But, DDIO is based on
memory locations rather than the location of the interrupt handler.

It would be ideal to make that connection: How does the location of the
interrupt handler impact the memory allocation location?

> when if the user enables SNC (sub-NUMA-clustering).

Again, the role that SNC plays here isn't spelled out. I *believe* it's
because SNC ends up reducing the number of CPUs in each NUMA node. That
makes the existing code run out of CPUs to which to bind to the "local"
node sooner.

> +static int find_nearest_node(int node, bool *used)
> +{
> + int i, min_dist, node_id = -1;
> +
> + /* Choose the first unused node to compare */
> + for (i = 0; i < nr_node_ids; i++) {
> + if (used[i] == false) {
> + min_dist = node_distance(node, i);
> + node_id = i;
> + break;
> + }
> + }
> +
> + /* Compare and return the nearest node */
> + for (i = 0; i < nr_node_ids; i++) {
> + if (node_distance(node, i) < min_dist && used[i] == false) {
> + min_dist = node_distance(node, i);
> + node_id = i;
> + }
> + }
> +
> + return node_id;
> +}
> +
> /**
> * cpumask_local_spread - select the i'th cpu with local numa cpu's first
> * @i: index number
> * @node: local numa_node
> *
> * This function selects an online CPU according to a numa aware policy;
> - * local cpus are returned first, followed by non-local ones, then it
> - * wraps around.
> + * local cpus are returned first, followed by the next one which is the
> + * nearest unused NUMA node based on NUMA distance, then it wraps around.
> *
> * It's not very efficient, but useful for setup.
> */
> unsigned int cpumask_local_spread(unsigned int i, int node)

FWIW, I think 'i' is criminally bad naming. It should be called
nr_cpus_to_skip or something similar.

I also detest the comments that are there today.

Loop through all the online CPUs on the system. Start with the
CPUs on 'node', then fall back to CPUs on NUMA nodes which are
increasingly far away.

Skip the first 'nr_cpus_to_skip' CPUs which are found.

This function is not very efficient, especially for large
'nr_cpus_to_skip' because it loops over the same CPUs on each
call and does not remember its state from previous calls.

> {
> - int cpu, hk_flags;
> + static DEFINE_SPINLOCK(spread_lock);
> + static bool used[MAX_NUMNODES];

I thought I mentioned this last time. How large is this array? How
large would it be if it were a nodemask_t? Would this be less code if
you just dynamically allocated and freed the node mask instead of having
a spinlock and a memset?

> + unsigned long flags;
> + int cpu, hk_flags, j, id;
> const struct cpumask *mask;
>
> hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> @@ -352,20 +379,27 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> return cpu;
> }
> } else {
> - /* NUMA first. */
> - for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> - if (i-- == 0)
> - return cpu;
> + spin_lock_irqsave(&spread_lock, flags);
> + memset(used, 0, nr_node_ids * sizeof(bool));
> + /* select node according to the distance from local node */
> + for (j = 0; j < nr_node_ids; j++) {
> + id = find_nearest_node(node, used);
> + if (id < 0)
> + break;

There's presumably an outer loop in a driver which is trying to bind a
bunch of interrupts to a bunch of CPUs. We know there are on the order
of dozens of these interrupts.

for_each_interrupt() // in the driver
for (j=0;j<nr_node_ids;j++) // cpumask_local_spread()
// find_nearest_node():
for (i = 0; i < nr_node_ids; i++) {
for (i = 0; i < nr_node_ids; i++) {

Does this worry anybody else? It thought our upper limits on the number
of NUMA nodes was 1024. Doesn't that make our loop O(N^3) where the
worst case is hundreds of millions of loops?

I don't want to prematurely optimize this, but that seems like something
that might just fall over on bigger systems.

This also seems really wasteful if we have a bunch of memory-only nodes.
Each of those will be found via find_nearest_node(), but then this loop:

> + for_each_cpu_and(cpu, cpumask_of_node(id), mask)
> + if (i-- == 0) {
> + spin_unlock_irqrestore(&spread_lock,
> + flags);
> + return cpu;
> + }
> + used[id] = true;
> }

Will just exit immediately because cpumask_of_node() is empty.

'used', for instance, should start by setting 'true' for all nodes which
are not in N_CPUS.


> + spin_unlock_irqrestore(&spread_lock, flags);
>
> - for_each_cpu(cpu, mask) {
> - /* Skip NUMA nodes, done above. */
> - if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
> - continue;
> -
> + for_each_cpu(cpu, mask)
> if (i-- == 0)
> return cpu;
> - }
> }
> BUG();
> }