Re: [PATCH] lib/group_cpus: make group CPU cluster aware

From: Guo, Wangyang

Date: Mon Jan 12 2026 - 20:59:43 EST


On 1/10/2026 10:24 AM, Guo, Wangyang wrote:
On 1/10/2026 3:13 AM, Radu Rendec wrote:
Hi all,

On Mon, 2025-12-22 at 11:03 +0800, Guo, Wangyang wrote:
On 12/22/2025 3:10 AM, Andrew Morton wrote:
On Fri, 24 Oct 2025 10:30:38 +0800 Wangyang Guo <wangyang.guo@xxxxxxxxx> wrote:

As CPU core counts increase, the number of NVMe IRQs may be smaller than
the total number of CPUs. This forces multiple CPUs to share the same
IRQ. If the IRQ affinity and the CPU’s cluster do not align, a
performance penalty can be observed on some platforms.

It would be helpful to quantify "performance penalty".  At least give
readers some approximate understanding of how serious this issue is,
please.

Thanks for your reminder, will update changelog in next version. We see
15%+ performance difference in FIO libaio/randread/bs=8k.

This patch improves IRQ affinity by grouping CPUs by cluster within each
NUMA domain, ensuring better locality between CPUs and their assigned
NVMe IRQs.

Reviewed-by: Tianyou Li <tianyou.li@xxxxxxxxx>
Reviewed-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
Tested-by: Dan Liang <dan.liang@xxxxxxxxx>
Signed-off-by: Wangyang Guo <wangyang.guo@xxxxxxxxx>

Patch hasn't attracted additional review so I'll queue this version for
some testing in mm.git's mm-nonmm-unstable branch.  I'll add a
note-to-self that a changelog addition is desirable.


Thanks a lot for your time and support! Please let me know if you have
any further comments or guidance. Any feedback would be appreciated.

With this patch applied, I see a weird issue in a qemu x86_64 vm if I
start it with a higher number of max CPUs than active CPUs, for example
`-smp 4,maxcpus=8` on the qemu command line.

What I see is the `while (1)` loop in alloc_cluster_groups() spinning
forever. Removing the `maxcpus=8` from the qemu command line fixes the
issue but so does reverting the patch :)

Thanks for the reporting. I will investigate this problem.
The problem happens in this loop:

/* Probe how many clusters in this node. */
while (1) {
cpu = cpumask_first(msk);
if (cpu >= nr_cpu_ids)
break;

cluster_mask = topology_cluster_cpumask(cpu);
/* Clean out CPUs on the same cluster. */
cpumask_andnot(msk, msk, cluster_mask);
ncluster++;
}

In this case, topology_cluster_cpumask(cpu) return an empty cluster_mask, which causes later cpumask_andnot invalid, entering a endless loop.

It can be fixed by checking returned cluster_mask:

cluster_mask = topology_cluster_cpumask(cpu);
+ if (!cpumask_weight(cluster_mask))
+ goto no_cluster;
/* Clean out CPUs on the same cluster. */
cpumask_andnot(msk, msk, cluster_mask);
ncluster++;

BR
Wangyang