RE: [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks

From: Achkinazi, Igor

Date: Wed May 20 2026 - 18:53:08 EST

Internal Use - Confidential
On Mon, May 18, 2026 at 03:52:09PM -0600, Keith Busch wrote:
> On Mon, May 18, 2026 at 08:59:09PM +0000, Achkinazi, Igor wrote:
> > - submit_bio_noacct_nocheck is block-internal and not exported, so using it
> > from NVMe would require a block-layer API change just for this.
>
> That is not a valid reason to not do this. We often export and unexport
> APIs as needs evolve. I'm not saying you have to use this API, but I'm
> just not yet seeing why we shouldn't.

Fair point

> > - it bypasses more checks than I see we need here (throttling, RO, crypto,
> > op-type), I prefer bypassing only the EOD check.
>
> I do not think we need any of those on the hidden device either. The
> stacked limits should have caught any problems or handled any policy on
> the first pass.

Agreed, however I want to point out that EOD might be special since the
capacity can change asynchronously between passes because set_capacity(0)
in nvme_ns_remove() runs before synchronize_srcu(), not after.

> > - BIO_REMAPPED propagates to split clones, so it covers all resubmissions,
> > not just the initial one.
>
> Submitting split clones already uses submit_bio_noacct_nocheck() so the
> BIO_REMAPPED flag doesn't come into play there either.

You are right, on current mainline (since commit 0b64682e78f7 "block:
skip unnecessary checks for split bio") split remainders go through
submit_bio_noacct_nocheck(). The split path was the trigger on the
older production kernels where we observed the failures (5.14, 6.4),
where splits still went through submit_bio_noacct().

Looking at this more carefully, the race still exists on current mainline
through the initial submit_bio_noacct() call inside
nvme_ns_head_submit_bio() itself, not in splits:

nvme_find_path(head)
-- returns ns, NVME_NS_READY was set
-- nvme_ns_remove() races in here:
-- clear_bit(NVME_NS_READY)
-- set_capacity(ns->disk, 0)
-- synchronize_srcu() blocks (we hold it)
bio_set_dev(bio, ns->disk->part0)
-- clears BIO_REMAPPED
submit_bio_noacct(bio)
-- bio_check_eod() sees capacity=0, fails

The SRCU read lock prevents synchronize_srcu() from completing, but it
does not prevent set_capacity(0) from executing. So the bio can fail
the EOD check on the per-path device while we're still inside the SRCU
read-side critical section.

> And BIO_REMAPPED applies to when the bio's bd_part is a partition. The
> multipath layer overrides the block device with the part0 of the path,
> so there is no partition (and we may not have been dealing with a
> partition in first place), so this patch is introducing a new implicit
> expectation on what this flag means.

I agree this sounds like overloading of the BIO_REMAPPED flag that was
introduced to skip partition remaps. However skipping bio_check_eod
and blk_partition_remap is what is needed: the per-path device is always
a whole disk so skipping blk_partition_remap is a no-op, and skipping
bio_check_eod is intentional to avoid the false IO error. This is the
same pattern as commit 3a905c37c351 ("block: skip bio_check_eod for
partition-remapped bios") which used BIO_REMAPPED to skip EOD on
resubmission after remapping.

I did not want to add a new block-layer flag for a single case that
needs the exact same behavior the existing flag provides.

> Consider a real failover going through the requeue_list using the
> submit_bio_noacct() again. If you've already set BIO_REMAPPED, then it
> skips the checks, but that could have happened across a controller
> format change that modified all the limits, so you probably want the eod
> checks to happen again on this scenario. Your suggestion would skip it
> because you're using BIO_REMAPPED as a proxy to skip "eod" checks.

I see that the failover path clears BIO_REMAPPED before the bio reaches
the requeue list. nvme_failover_req() calls bio_set_dev() on each bio
to redirect it back to the multipath head:

void nvme_failover_req(struct request *req)
{
...
spin_lock_irqsave(&ns->head->requeue_lock, flags);
for (bio = req->bio; bio; bio = bio->bi_next)
bio_set_dev(bio, ns->head->disk->part0); /* clears BIO_REMAPPED */
blk_steal_bios(&ns->head->requeue_list, req);
spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
...
}

Then nvme_requeue_work() calls submit_bio_noacct(bio) with BIO_REMAPPED
already cleared, so bio_check_eod() runs normally on the multipath head.
If a controller format changed the capacity in between, the EOD check on
the multipath head catches it.

I see from your "validate bios against queue limits" discussion that
Christoph Hellwig suggested eliminating the unconditional set_capacity(0)
entirely and you are exploring whether the GD_DEAD checks are sufficient
to replace it. That might be the root cause fix.

You also noted in the mentioned thread that BIO_REMAPPED does not cover
bio_queue_enter() -> GD_DEAD -> bio_io_error() path. That is true, but
I think it is a separate race. The error we saw in our tests is
specifically bio_check_eod():

"attempt to access beyond end of device"
"nvme1c9n1: rw=33556480, sector=476160, nr_sectors=256 limit=0"

BIO_REMAPPED addresses this reported failure. If removing
set_capacity(0) from nvme_ns_remove() goes in as part of your RFC
series, it fixes both races and this patch is not needed. Until then,
this patch provides a minimal fix for the bio_check_eod() case that is
backportable to stable kernels affected today.

I will send a v2 with an updated commit message that clarifies the
primary race and drops the arguments you rightly pointed out aren’t
valid.

Thanks,
Igor