Re: [PATCH v10 13/13] docs: add io_queue flag to isolcpus

From: Aaron Tomlin

Date: Sun Apr 12 2026 - 18:50:55 EST

On Sat, Apr 11, 2026 at 08:52:00PM +0800, Ming Lei wrote:
> > The critical issue lies at the invocation of group_cpus_evenly(). Without
> > this patchset, the core logic lacks the necessary constraints to respect
> > CPU isolation. It is entirely possible, and indeed happens in practice, for
> > an isolated CPU to be assigned to a CPU mask group.
>
> It is one bug report? No, because it doesn't show any trouble from user
> viewpoint.

Hi Ming,

The lack of a formal bug report does not negate the fact that the current
behaviour silently breaks the fundamental contract of CPU isolation from
the administrator's perspective.

To illustrate the user-visible impact, the following demonstrates the
difference between relying on isolcpus=managed_irq and isolcpus=io_queue
under 7.0.0-rc3-00065-gd80965e205a5, which includes this series.

The Broadcom MPI3 Storage Controller driver allocates a full complement of
48 operational queue pairs. Consequently, a number of MSI-X vectors are
generated and mapped directly onto the isolated cores thereby breaching
isolation.

# uname -r
7.0.0-rc3-00065-gd80965e205a5

# tr ' ' '\n' < /proc/cmdline | grep isolcpus=
isolcpus=managed_irq,domain,2-47

# cat /sys/devices/system/cpu/isolated
2-47

# dmesg | grep -A 6 'MSI-X vectors supported:'
[ 2.981705] mpi3mr0: MSI-X vectors supported: 128, no of cores: 48,
[ 2.981705] mpi3mr0: MSI-X vectors requested: 49 poll_queues 0
[ 3.001915] mpi3mr0: trying to create 48 operational queue pairs
[ 3.011214] mpi3mr0: allocating operational queues through segmented queues
[ 3.101903] mpi3mr0: successfully created 48 operational queue pairs(default/polled) queue = (2/0)
[ 3.111468] mpi3mr0: controller initialization completed successfully

# awk '/mpi3mr0/ { print $1" "$NF }' /proc/interrupts
78: mpi3mr0-msix0
79: mpi3mr0-msix1
80: mpi3mr0-msix2
81: mpi3mr0-msix3
82: mpi3mr0-msix4
83: mpi3mr0-msix5
84: mpi3mr0-msix6
85: mpi3mr0-msix7
86: mpi3mr0-msix8
87: mpi3mr0-msix9
88: mpi3mr0-msix10
89: mpi3mr0-msix11
90: mpi3mr0-msix12
...
122: mpi3mr0-msix44
123: mpi3mr0-msix45
124: mpi3mr0-msix46
125: mpi3mr0-msix47
126: mpi3mr0-msix48

# grep -H '' /proc/irq/{119,120,121,122}/{effective,smp}_affinity_list
/proc/irq/119/effective_affinity_list:42
/proc/irq/119/smp_affinity_list:42
/proc/irq/120/effective_affinity_list:43
/proc/irq/120/smp_affinity_list:43
/proc/irq/121/effective_affinity_list:44
/proc/irq/121/smp_affinity_list:44
/proc/irq/122/effective_affinity_list:45
/proc/irq/122/smp_affinity_list:45

Now with isolcpus=io_queue,2-47 the allocation is structurally restricted
at the source. The driver creates only two operational queues, confining
all resulting interrupts exclusively to housekeeping CPUs (0 and 1):

# uname -r
7.0.0-rc3-00065-gd80965e205a5

# tr ' ' '\n' < /proc/cmdline | grep isolcpus=
isolcpus=io_queue,domain,2-47

# cat /sys/devices/system/cpu/isolated
2-47

# dmesg | grep -A 6 'MSI-X vectors supported:'
[ 3.284850] mpi3mr0: MSI-X vectors supported: 128, no of cores: 48,
[ 3.284851] mpi3mr0: MSI-X vectors requested: 49 poll_queues 0
[ 3.305492] mpi3mr0: allocated vectors (3) are less than configured (49)
[ 3.316528] mpi3mr0: trying to create 2 operational queue pairs
[ 3.328013] mpi3mr0: allocating operational queues through segmented queues
[ 3.340697] mpi3mr0: successfully created 2 operational queue pairs(default/polled) queue = (2/0)
[ 3.350664] mpi3mr0: controller initialization completed successfully

# awk '/mpi3mr0/ { print $1" "$NF }' /proc/interrupts
79: mpi3mr0-msix0
80: mpi3mr0-msix1
81: mpi3mr0-msix2

# grep -H '' /proc/irq/{79,80,81}/{effective,smp}_affinity_list
/proc/irq/79/effective_affinity_list:1
/proc/irq/79/smp_affinity_list:1
/proc/irq/80/effective_affinity_list:1
/proc/irq/80/smp_affinity_list:1
/proc/irq/81/effective_affinity_list:0
/proc/irq/81/smp_affinity_list:0

> Sebastian explains/shows how "isolcpus=managed_irq" works perfectly in the
> following link:
>
> https://lore.kernel.org/all/20260401110232.ET5RxZfl@xxxxxxxxxxxxx/
>
> You have reviewed it...
>
> What matters is that IO won't interrupt isolated CPU.

The isolcpus=managed_irq acts as a "best effort" avoidance algorithm rather
than a strict, unbreakable constraint. This is indicated in the proposed
changes to Documentation/core-api/irq/managed_irq.rst [1].

[1]: https://lore.kernel.org/all/20260401110232.ET5RxZfl@xxxxxxxxxxxxx/

The following is an excerpt of irq_do_set_affinity().

- File: kernel/irq/manage.c

232 int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, bool force)
233 {
234 struct cpumask *tmp_mask = this_cpu_ptr(&__tmp_mask);
:
262 if (irqd_affinity_is_managed(data) &&
263 housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
264 const struct cpumask *hk_mask;
265
266 hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
267
268 cpumask_and(tmp_mask, mask, hk_mask);
269 if (!cpumask_intersects(tmp_mask, cpu_online_mask))
270 prog_mask = mask;
271 else
272 prog_mask = tmp_mask;
273 } else {
274 prog_mask = mask;
275 }

1. If the requested mask consists only of isolated CPUs (e.g., 2-47),
it will have zero intersection with the hk_mask (which contains
only the housekeeping CPUs). Consequently, the resulting tmp_mask
becomes completely empty.

2. Because the tmp_mask is empty, it cannot intersect with the
cpu_online_mask.

3. The kernel triggers this fallback path. It abandons the empty,
filtered tmp_mask and reverts back to the originally requested
mask, which only contains isolated CPUs. Consequently, the
interrupt is routed directly to an isolated CPU, proving that
managed_irq cannot guarantee isolation.

> > The newer implementation of irq_create_affinity_masks() introduced by this
> > series resolves this. It considers the new CPU mask added to the IRQ
> > affinity descriptor. When group_mask_cpus_evenly() is called, this mask is
> > evaluated [1], guaranteeing that isolated CPUs are entirely excluded from
> > the mask groups.
> >
> > [1]: https://lore.kernel.org/lkml/20260401222312.772334-8-atomlin@xxxxxxxxxxx/
>
> Not at all.
>
> isolated CPU is still included in each group's cpu mask, please see patch
> 9:

You are entirely correct. The actual structural exclusion preventing the
interrupts from landing on those cores occurs subsequently via
irq_spread_hk_filter() in irq_create_affinity_masks() as per patch 12 [2].

[2]: https://lore.kernel.org/lkml/20260401222312.772334-13-atomlin@xxxxxxxxxxx/

Kind regards,
--
Aaron Tomlin

Attachment: signature.asc
Description: PGP signature