[PATCH v13 0/8] blk: honor isolcpus configuration

From: Aaron Tomlin

Date: Tue May 12 2026 - 20:55:32 EST


Hi,

I have decided to drive this series forward on behalf of Daniel Wagner, the
original author. The series has been rebased on v7.1-rc2-593-g1d5dcaa3bd65.

This series introduces a new CPU isolation feature, "isolcpus=io_queue",
designed to protect isolated cores from the disruptive hardware interrupts
generated by high-performance multi-queue devices.

When enabled, it fundamentally alters how the generic IRQ subsystem and the
block layer (blk-mq) map hardware queues:

1. Restricted IRQ Affinity: Managed hardware interrupts are strictly
confined to online housekeeping CPUs.

2. Transparent I/O Submission: Applications running on isolated CPUs
can still seamlessly submit I/O requests; however, the resulting
hardware completion interrupts are safely routed to a designated
housekeeping CPU.

3. Topology-Aware Queue Allocation: The generic CPU-to-hardware-queue
mapping logic is extended to distribute hardware contexts evenly
among the available housekeeping CPUs, preventing MSI-X vector
exhaustion while maintaining optimal cache locality where possible.

To prevent I/O stalls, the block layer is additionally hardened to reject
hot-plug requests that attempt to offline a housekeeping CPU if it is the
last remaining CPU actively serving an online isolated core.

This iteration abandons the complex "top-down" mask plumbing introduced in
v12, which modified struct irq_affinity and expanded block layer APIs, in
favour of centralised, direct isolation querying via
housekeeping_cpumask(HK_TYPE_IO_QUEUE) within the genirq/affinity
subsystem. This architectural simplification successfully decouples core
changes from driver-specific implementations, allowing us to drop the
virtio enablement and API modification patches (v12 patches 4, 5, 7, 8, and
9).

Please let me know your thoughts.


Changes since v12:

- Resolved TOCTOU race conditions against CPU hotplug events in
blk_mq_map_queues() and group_mask_cpus_evenly() by taking lockless
snapshots of the online CPU mask prior to algorithmic evaluation.

- Migrated the active_hctx tracking to a dynamically sized bitmap
(bitmap_zalloc), resolving a critical out-of-bounds memory write that
occurred when hardware queues exceeded the system CPU count.

- Wrapped the disk pointer fetch in blk_mq_hctx_can_offline_hk_cpu() with
READ_ONCE() to prevent a TOCTOU NULL pointer dereference against
concurrent device teardowns.

- Introduced bitmap_empty() checks to prevent the mapping logic from
routing unassigned CPUs into unallocated memory when all mapped CPUs are
offline, safely forcing a fallback mapping instead.

- Implemented a native two-stage distribution logic in
group_mask_cpus_evenly() that first prioritises physically present CPUs
to prevent I/O starvation before distributing remaining vectors to
non-present CPUs for hotplug safety.

- Restricted the maximum number of allocated vectors in
irq_calc_affinity_vectors() to the weight of the housekeeping mask,
preventing drivers from wasting memory on dead hardware queues that
physically cannot be routed.

- Added padding logic using irq_default_affinity for sets where isolation
constraints yield fewer masks than requested vectors, preserving the 1:1
hardware queue mapping sequence for subsequent sets.

- Fixed a logic flaw that prematurely rejected valid offline requests by
manually iterating over cpu_online_mask and reverse-mapping to
accurately detect isolated CPUs, properly permitting the offlining of
non-housekeeping CPUs.

- Corrected an absolute versus relative queue index calculation bug in
blk_mq_map_queues() that was overwriting loop iterations, by iterating
directly over the generated masks.

- Replaced scoped __free cleanups with traditional goto unwinding in the
block layer to align with subsystem styling guidelines.

- Refined the io_queue kernel command-line parameter documentation for
better clarity and precision.

Changes since v11:

- Removed duplicate paragraph from the commit message in patch 11
(Marco Crivellari)

- Ensure ZERO_SIZE_PTR is not returned by group_mask_cpus_evenly()
(Marco Crivellari)

- Linked to v11: https://lore.kernel.org/lkml/20260416192942.1243421-1-atomlin@xxxxxxxxxxx/

Changes since v10:

- Completely rewrote the isolcpus=io_queue documentation in
Documentation/admin-guide/kernel-parameters.txt to clarify its exclusive
application to managed IRQs, queue allocation limits, vector exhaustion
prevention, and hardware interrupt routing (Ming Lei)

- Fixed a stack frame bloat issue by avoiding the on-stack declaration of
struct cpumask (Waiman Long)

- Linked to v10: https://lore.kernel.org/linux-nvme/20260401222312.772334-1-atomlin@xxxxxxxxxxx/

Changes since v9:

- Fixed a page fault regression encountered when initialising secondary
queue maps (e.g., NVMe poll queues). Restored the qmap->queue_offset to
the mq_map assignment to ensure CPUs are strictly mapped to absolute
hardware indices (Keith Busch)

- Corrected the active_hctx tracker to utilise relative queue indices,
preventing out-of-bounds mask assignments

- Fixed the blk_mq_validate() sanity check to properly evaluate absolute
queue indices against the offset-adjusted loop index

- Corrected typographical errors within block/blk-mq-cpumap.c
(Keith Busch)

- Clarified the commit message regarding the removal of the !SMP fallback
code, explicitly noting that the core scheduler now mandates SMP
unconditionally (Sebastian Andrzej Siewior)

- Added missing "Signed-off-by:" tags to properly record the patch series
chain of custody

- Linked to v9: https://lore.kernel.org/lkml/20260330221047.630206-1-atomlin@xxxxxxxxxxx/

Changes since v8:

- Added "Reviewed-by:" tags

- Introduced irq_spread_hk_filter() to safely restrict managed IRQ
affinity to housekeeping CPUs (Thomas Gleixner)

- Removed the unsafe global static variable blk_hk_online_mask from
blk-mq-cpumap.c and blk-mq.c. blk_mq_online_queue_affinity() now returns
a stable pointer, delegating safe intersection to the callers to prevent
concurrent modification races (Thomas Gleixner, Hannes Reinecke)

- Resolved BUG: kernel NULL pointer dereference in __blk_mq_all_tag_iter
reported by the kernel test robot during cpuhotplug rcutorture stress
testing

- Linked to v8: https://lore.kernel.org/lkml/20250905-isolcpus-io-queues-v8-0-885984c5daca@xxxxxxxxxx/

Changes since v7:

- Added commit 524f5eea4bbe ("lib/group_cpus: remove !SMP code")

- Merged the new mapping logic directly into the existing function to
avoid special casing

- Refined the group_mask_cpus_evenly() implementation with the following
updates:

- Corrected the function name typo (changed group_masks_cpus_evenly to
group_mask_cpus_evenly)

- Updated the documentation comment to accurately reflect the function's
behavior

- Renamed the cpu_mask argument to mask for consistency

- Added a new patch for aacraid to include the missing number of queues
calculation

- Restricted updates to only affect SCSI drivers that support
PCI_IRQ_AFFINITY and do not utilise nvme-fabrics

- Removed the __free cleanup attribute usage for cpumask_var_t allocations
due to compatibility issues

- Updated the documentation to explicitly highlight the limitations
surrounding CPU offlining

- Collected accumulated Reviewed-by and Acked-by tags

- Linked to v7: https://patch.msgid.link/20250702-isolcpus-io-queues-v7-0-557aa7eacce4@xxxxxxxxxx

Changes since v6:

- Sent out the first part of the series independently:
https://lore.kernel.org/all/20250617-isolcpus-queue-counters-v1-0-13923686b54b@xxxxxxxxxx/

- Added comprehensive kernel command-line documentation

- Added validation logic to ensure the resulting CPU-to-queue mapping is
fully operational

- Rewrote the isolcpus mapping code to properly account for active
hardware contexts (hctx)

- Introduced blk_mq_map_hk_irq_queues, which utilizes the mask retrieved
from irq_get_affinity()

- Refactored blk_mq_map_hk_queues to require the caller to explicitly test
for HK_TYPE_MANAGED_IRQ

- Linked to v6: https://patch.msgid.link/20250424-isolcpus-io-queues-v6-0-9a53a870ca1f@xxxxxxxxxx

Changes since v5:

- Reintroduced the io_queue type for the isolcpus kernel parameter

- Prevented the offlining of a housekeeping CPU if an isolated CPU is
still present, upgrading this behavior from a simple warning to a hard
restriction

- Linked to v5: https://lore.kernel.org/r/20250110-isolcpus-io-queues-v5-0-0e4f118680b0@xxxxxxxxxx

Changes since v4:

- Rebased the series onto the latest for-6.14/block branch.

- Updated the documentation regarding the managed_irq parameters

- Reworded the commit message for "blk-mq: issue warning when offlining
hctx with online isolcpus" for better clarity

- Split the input and output parameters in the patch "lib/group_cpus: let
group_cpu_evenly return number of groups"

- Dropped the patch "sched/isolation: document HK_TYPE housekeeping
option"

- Linked to v4: https://lore.kernel.org/r/20241217-isolcpus-io-queues-v4-0-5d355fbb1e14@xxxxxxxxxx

Changes since v3:

- Added the patch "blk-mq: issue warning when offlining hctx with online
isolcpus"

- Fixed the check in group_cpus_evenly(); the condition now properly uses
housekeeping_enabled() instead of cpumask_weight(), as the latter always
returns a valid mask

- Dropped the Fixes: tag from "lib/group_cpus.c: honor housekeeping config
when grouping CPUs"

- Fixed an overlong line warning in the patch "scsi: use block layer
helpers to calculate num of queues"

- Dropped the patch "sched/isolation: Add io_queue housekeeping option" in
favor of simply documenting the housekeeping hk_type enum

- Added the patch "lib/group_cpus: let group_cpu_evenly return number of
groups"

- Collected accumulated Reviewed-by and Acked-by tags

- Split the patchset by moving foundational changes into a separate
preparation series:
https://lore.kernel.org/linux-nvme/20241202-refactor-blk-affinity-helpers-v6-0-27211e9c2cd5@xxxxxxxxxx/

- Linked to v3: https://lore.kernel.org/r/20240806-isolcpus-io-queues-v3-0-da0eecfeaf8b@xxxxxxx

Changes since v2:

- Integrated patches from Ming Lei
(https://lore.kernel.org/all/20210709081005.421340-1-ming.lei@xxxxxxxxxx/):
"virtio: add APIs for retrieving vq affinity" and "blk-mq: introduce
blk_mq_dev_map_queues"

- Replaced all instances of blk_mq_pci_map_queues and
blk_mq_virtio_map_queues with the new unified blk_mq_dev_map_queues

- Updated and expanded the helper functions used for calculating the
number of queues

- Added the CPU-to-hctx mapping function specifically to support the
isolcpus=io_queue parameter

- Documented the hk_type enum and the newly introduced isolcpus=io_queue
parameter

- Added the patch "scsi: pm8001: do not overwrite PCI queue mapping"

- Linked to v2: https://lore.kernel.org/r/20240627-isolcpus-io-queues-v2-0-26a32e3c4f75@xxxxxxx

Changes since v1:

- Updated the feature documentation for clarity and completeness

- Split the blk/nvme-pci patch into smaller, logical commits

- Dropped the HK_TYPE_IO_QUEUE macro in favor of reusing
HK_TYPE_MANAGED_IRQ

- Linked to v1: https://lore.kernel.org/r/20240621-isolcpus-io-queues-v1-0-8b169bf41083@xxxxxxx


Aaron Tomlin (1):
genirq/affinity: Restrict managed IRQ affinity to housekeeping CPUs

Daniel Wagner (7):
scsi: aacraid: use block layer helpers to calculate num of queues
lib/group_cpus: remove dead !SMP code
lib/group_cpus: Add group_mask_cpus_evenly()
isolation: Introduce io_queue isolcpus type
blk-mq: use hk cpus only when isolcpus=io_queue is enabled
blk-mq: prevent offlining hk CPUs with associated online isolated CPUs
docs: add io_queue flag to isolcpus

.../admin-guide/kernel-parameters.txt | 30 ++-
block/blk-mq-cpumap.c | 224 ++++++++++++++++--
block/blk-mq.c | 56 +++++
drivers/scsi/aacraid/comminit.c | 3 +-
include/linux/group_cpus.h | 3 +
include/linux/sched/isolation.h | 1 +
kernel/irq/affinity.c | 35 ++-
kernel/sched/isolation.c | 7 +
lib/group_cpus.c | 108 ++++++++-
9 files changed, 427 insertions(+), 40 deletions(-)


base-commit: 1d5dcaa3bd65f2e8c9baa14a393d3a2dc5db7524
--
2.51.0