Re: [PATCH] lib/group_cpus: rotate extra groups to avoid IRQ stacking

From: Andrew Morton

Date: Sun Apr 26 2026 - 16:54:55 EST


On Tue, 24 Mar 2026 07:53:52 +0000 Naman Jain <namjain@xxxxxxxxxxxxxxxxxxx> wrote:

> When multiple devices call group_cpus_evenly() with the same number of
> groups, the cluster-aware path in __try_group_cluster_cpus() assigns
> extra groups to the same set of clusters every time, producing identical
> affinity masks for every caller. CPUs in clusters that receive two
> groups (and thus get single-CPU dedicated masks) end up handling
> interrupts from ALL devices, creating an IRQ imbalance.
>
> For example, on a 96-CPU / 2-NUMA-node system with 24 clusters of
> 2 CPUs each and 6 NVMe disks each requesting 62 vectors:
> alloc_groups_to_nodes() distributes 31 groups across 24 clusters,
> giving 7 clusters 2 groups (single-CPU mask = dedicated) and 17
> clusters 1 group (2-CPU mask = shared). Because the assignment is
> deterministic, all 6 disks produce the same mapping and the same 14
> CPUs each accumulate 6 dedicated IRQs -- roughly twice the interrupt
> load of other CPUs -- causing up to 11% per-disk throughput degradation
> on IRQ-heavy CPUs.
>
> Fix this by introducing a per-caller rotation offset via a static
> atomic counter. After alloc_groups_to_nodes() determines each
> cluster's group count, collect the extras (groups above the per-cluster
> minimum), then redistribute them starting from a rotated position with
> a stride of ncluster/extras so that successive callers scatter their
> extra groups across different clusters. A capacity check
> (cpumask_weight_and) ensures no cluster is assigned more groups than it
> has CPUs, with a fallback loop for any extras that could not be placed
> in the strided pass.
>
> For systems without cluster topology, the same rotation is applied in
> assign_cpus_to_groups() at the per-group level: the modular expression
> (v + spread_offset) % nv->ngroups selects which groups receive the
> extra CPU, replacing the previous sequential decrement.

Thanks. AI review asked a couple of questions.

https://sashiko.dev/#/patchset/20260324075352.2326972-1-namjain@xxxxxxxxxxxxxxxxxxx


group_cpus.c is difficult. It's tricky code and few people seem to be
familiar with it - I certainly don't feel competent to review changes.

The original author (Ming Lei) is still around, but wasn't cc'ed on
this change. Let me add. (I'm seeing two Ming Lei's - apologies if
they aren't the same person ;))