Re: [PATCH v9 09/13] isolation: Introduce io_queue isolcpus type
From: Aaron Tomlin
Date: Thu Apr 02 2026 - 20:56:01 EST
On Thu, Apr 02, 2026 at 11:09:40AM +0200, Sebastian Andrzej Siewior wrote:
> On 2026-04-01 16:58:22 [-0400], Aaron Tomlin wrote:
> > Hi Sebastian,
> Hi,
>
> > Thank you for taking the time to document the "managed_irq" behaviour; it
> > is immensely helpful. You raise a highly pertinent point regarding the
> > potential proliferation of "isolcpus=" flags. It is certainly a situation
> > that must be managed carefully to prevent every subsystem from demanding
> > its own bit.
> >
> > To clarify the reasoning behind introducing "io_queue" rather than strictly
> > relying on managed_irq:
> >
> > The managed_irq flag belongs firmly to the interrupt subsystem. It dictates
> > whether a CPU is eligible to receive hardware interrupts whose affinity is
> > managed by the kernel. Whilst many modern block drivers use managed IRQs,
> > the block layer multi-queue mapping encompasses far more than just
> > interrupt routing. It maps logical queues to CPUs to handle I/O submission,
> > software queues, and crucially, poll queues, which do not utilise
> > interrupts at all. Furthermore, there are specific drivers that do not use
> > the managed IRQ infrastructure but still rely on the block layer for queue
> > distribution.
>
> Could you tell block which queue maps to which CPU at /sys/block/$$/mq/
> level? Then you have one queue going to one CPU.
> Then the drive could request one or more interrupts managed or not. For
> managed you could specify a CPU mask which you desire to occupy.
> You have the case where
> - you have more queues than CPUs
> - use all of them
> - use less
> - less queues than CPUs
> - mapped a queue to more than once CPU in case it goes down or becomes
> not available
> - mapped to one CPU
>
> Ideally you solve this at one level so that the device(s) can request
> less queues than CPUs if told so without patching each and every driver.
>
> This should give you the freedom to isolate CPUs, decide at boot time
> which CPUs get I/O queues assigned. At run time you can tell which
> queues go to which CPUs. If you shutdown a queue, the interrupt remains
> but does not get any I/O requests assigned so no problem. If the CPU
> goes down, same thing.
>
> I am trying to come up with a design here which I haven't found so far.
> But I might be late to the party and everyone else is fully aware.
>
> > If managed_irq were solely relied upon, the IRQ subsystem would
> > successfully keep hardware interrupts off the isolated CPUs, but the block
>
> The managed_irqs can't be influence by userland. The CPUs are auto
> distributed.
>
> > layer would still blindly map polling queues or non-managed queues to those
> > same isolated CPUs. This would force isolated CPUs to process I/O
> > submissions or handle polling tasks, thereby breaking the strict isolation.
> >
> > Regarding the point about the networking subsystem, it is a very valid
> > comparison. If the networking layer wishes to respect isolcpus in the
> > future, adding a net flag would indeed exacerbate the bit proliferation.
>
> Networking could also have different cases like adding a RX filter and
> having HW putting packet based on it in a dedicated queue. But also in
> this case I would like to have the freedome to decide which isolated
> CPUs should receive interrupts/ traffic and which don't.
>
> > For the present time, retaining io_queue seems the most prudent approach to
> > ensure that block queue mapping remains semantically distinct from
> > interrupt delivery. This provides an immediate and clean architectural
> > boundary. However, if the consensus amongst the maintainers suggests that
> > this is too granular, alternative approaches could certainly be considered
> > for the future. For instance, a broader, more generic flag could be
> > introduced to encompass both block and future networking queue mappings.
> > Alternatively, if semantic conflation is deemed acceptable, the existing
> > managed_irq housekeeping mask could simply be overloaded within the block
> > layer to restrict all queue mappings.
> >
> > Keeping the current separation appears to be the cleanest solution for this
> > series, but your thoughts, and those of the wider community, on potentially
> > migrating to a consolidated generic flag in the future would be very much
> > welcomed.
>
> I just don't like introducing yet another boot argument, making it a
> boot constraint while in my naive view this could be managed at some
> degree via sysfs as suggested above.
Hi Sebastian,
I believe it would be more prudent to defer to Thomas Gleixner and Jens
Axboe on this matter.
Indeed, I am entirely sympathetic to your reluctance to introduce yet
another boot parameter, and I concur that run-time configurability
represents the ideal scenario for system tuning.
At present, a device such as an NVMe controller allocates its hardware
queues and requests its interrupt vectors during the initial device probe
phase. The block layer calculates the optimal queue to CPU mapping based on
the system topology at that precise moment. Altering this mapping
dynamically at runtime via sysfs would be an exceptionally intricate
undertaking. It would necessitate freezing all active operations, tearing
down the physical hardware queues on the device, renegotiating the
interrupt vectors with the peripheral component interconnect subsystem, and
finally reconstructing the entire queue map.
Furthermore, the proposed io_queue boot parameter successfully achieves the
objective of avoiding driver level modifications. By applying the
housekeeping mask constraint centrally within the core block layer mapping
helpers, all multiqueue drivers automatically inherit the CPU isolation
boundaries without requiring a single line of code to be changed within the
individual drivers themselves.
Because the hardware queue count and CPU alignment must be calculated as
the device initialises, a reliable mechanism is required to inform the
block layer of which CPUs are strictly isolated before the probe sequence
commences. This is precisely why integrating with the existing boot time
housekeeping infrastructure is currently the most viable and robust
solution.
Whilst a fully dynamic sysfs driven reconfiguration architecture would be a
great, it would represent a substantial paradigm shift for the block layer.
For the present time, the io_queue flag resolves the immediate and severe
latency issues experienced by users with isolated CPUs, employing an
established and safe methodology.
This is at least my understanding.
Kind regards,
--
Aaron Tomlin