Re: [PATCH] lib/group_cpus: rotate extra groups to avoid IRQ stacking

From: Naman Jain

Date: Mon Apr 27 2026 - 04:45:16 EST

On 4/27/2026 2:22 AM, Andrew Morton wrote:

On Tue, 24 Mar 2026 07:53:52 +0000 Naman Jain <namjain@xxxxxxxxxxxxxxxxxxx> wrote:

When multiple devices call group_cpus_evenly() with the same number of
groups, the cluster-aware path in __try_group_cluster_cpus() assigns
extra groups to the same set of clusters every time, producing identical
affinity masks for every caller. CPUs in clusters that receive two
groups (and thus get single-CPU dedicated masks) end up handling
interrupts from ALL devices, creating an IRQ imbalance.

For example, on a 96-CPU / 2-NUMA-node system with 24 clusters of
2 CPUs each and 6 NVMe disks each requesting 62 vectors:
alloc_groups_to_nodes() distributes 31 groups across 24 clusters,
giving 7 clusters 2 groups (single-CPU mask = dedicated) and 17
clusters 1 group (2-CPU mask = shared). Because the assignment is
deterministic, all 6 disks produce the same mapping and the same 14
CPUs each accumulate 6 dedicated IRQs -- roughly twice the interrupt
load of other CPUs -- causing up to 11% per-disk throughput degradation
on IRQ-heavy CPUs.

Fix this by introducing a per-caller rotation offset via a static
atomic counter. After alloc_groups_to_nodes() determines each
cluster's group count, collect the extras (groups above the per-cluster
minimum), then redistribute them starting from a rotated position with
a stride of ncluster/extras so that successive callers scatter their
extra groups across different clusters. A capacity check
(cpumask_weight_and) ensures no cluster is assigned more groups than it
has CPUs, with a fallback loop for any extras that could not be placed
in the strided pass.

For systems without cluster topology, the same rotation is applied in
assign_cpus_to_groups() at the per-group level: the modular expression
(v + spread_offset) % nv->ngroups selects which groups receive the
extra CPU, replacing the previous sequential decrement.

Thanks. AI review asked a couple of questions.

https://sashiko.dev/#/patchset/20260324075352.2326972-1-namjain@xxxxxxxxxxxxxxxxxxx

group_cpus.c is difficult. It's tricky code and few people seem to be
familiar with it - I certainly don't feel competent to review changes.

The original author (Ming Lei) is still around, but wasn't cc'ed on
this change. Let me add. (I'm seeing two Ming Lei's - apologies if
they aren't the same person ;))

Thank you Andrew. The points raised in the Sashiko's AI review seems to be valid, with respect to asymmetric topology. I'll address them in v2.

Regards,
Naman