[PATCH v2] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
From: Achkinazi, Igor
Date: Thu May 28 2026 - 13:07:22 EST
When nvme_ns_head_submit_bio() remaps a bio from the multipath head to
a per-path namespace, bio_set_dev() clears BIO_REMAPPED. The remapped
bio is then resubmitted through submit_bio_noacct() which calls
bio_check_eod() because BIO_REMAPPED is not set.
This races with nvme_ns_remove() which zeroes the per-path capacity
before synchronize_srcu():
CPU 0 (IO submission)
---------------------
srcu_read_lock()
nvme_find_path() -> ns
[NVME_NS_READY is set]
CPU 1 (namespace removal)
-------------------------
clear_bit(NVME_NS_READY)
set_capacity(ns->disk, 0)
synchronize_srcu() <- blocks
CPU 0 (IO submission)
---------------------
bio_set_dev(bio, ns->disk->part0)
[clears BIO_REMAPPED]
submit_bio_noacct(bio)
-> bio_check_eod() sees capacity=0
-> bio fails with IO error
The SRCU read lock prevents synchronize_srcu() from completing, but
does not prevent set_capacity(0) from executing. The bio fails the
EOD check before it reaches the NVMe driver, so nvme_failover_req()
never gets a chance to redirect it to another path of multipath. IO errors
are reported to the application despite another path being available.
On older kernels (before commit 0b64682e78f7 "block: skip unnecessary
checks for split bio"), the same race was also reachable through split
remainders resubmitted via submit_bio_noacct().
Observed during NVMe multipath failover testing at Dell on
5.14.0-570.23.1.el9_6.x86_64 (RHEL 9.7) and
6.4.0-150600.23.53-default (SLES 15.6).
Fix this by setting BIO_REMAPPED after bio_set_dev() in
nvme_ns_head_submit_bio(). This skips bio_check_eod() on the per-path
device; the EOD check already passed on the multipath head.
NVMe per-path namespace devices are always whole disks (bd_partno=0),
so the blk_partition_remap() skip also gated by BIO_REMAPPED is a
no-op. The flag does not persist across failover and cannot go stale
if the namespace geometry changes between attempts: nvme_failover_req()
calls bio_set_dev() to redirect the bio back to the multipath head,
which clears BIO_REMAPPED. When nvme_requeue_work() resubmits through
submit_bio_noacct(), bio_check_eod() runs normally against the current
capacity.
Same approach as commit 3a905c37c351 ("block: skip bio_check_eod for
partition-remapped bios").
A broader solution that moves bio validation into the queue-entered
context and eliminates the set_capacity(0) hack is being developed
upstream, however this minimal fix is suitable for backporting to
stable kernels affected today. The link to the mentioned patch:
https://lore.kernel.org/linux-block/20260519172326.3462354-1-kbusch@xxxxxxxx/
Fixes: a7c7f7b2b641 ("nvme: use bio_set_dev to assign ->bi_bdev")
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Igor Achkinazi <igor.achkinazi@xxxxxxxx>
---
v2:
- Corrected race description: primary race is in the initial
submit_bio_noacct() call in nvme_ns_head_submit_bio(), not
only in split remainders (which are no longer affected on
current mainline since commit 0b64682e78f7)
- Dropped incorrect arguments about submit_bio_noacct_nocheck
export status and BIO_REMAPPED propagation to split clones
- Added analysis showing BIO_REMAPPED flag does not persist
across failover (nvme_failover_req clears it via bio_set_dev)
- Referenced upstream RFC series addressing the root cause
drivers/nvme/host/multipath.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 263161cb8ac0..04f7c7e59945 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -511,6 +511,13 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
ns = nvme_find_path(head);
if (likely(ns)) {
bio_set_dev(bio, ns->disk->part0);
+ /*
+ * Skip bio_check_eod() when this bio enters
+ * submit_bio_noacct() for the per-path device.
+ * The EOD check already passed on the multipath head.
+ */
+ bio_set_flag(bio, BIO_REMAPPED);
bio->bi_opf |= REQ_NVME_MPATH;
trace_block_bio_remap(bio, disk_devt(ns->head->disk),
bio->bi_iter.bi_sector);
--
2.43.0
Internal Use - Confidential