[RFC PATCH 0/1] nvme-pci: detect I/O queue depth changes after reset

From: guzebing

Date: Wed May 27 2026 - 03:55:24 EST

From: Guzebing <guzebing@xxxxxxxxxxxxx>

We have hit a case where an NVMe firmware activation made the controller
report a different CAP.MQES value after the following controller reset.
This was seen in production on a Memblaze PBlaze5 510 device:

model: P5510DS0384T00
old firmware: 224005A0, CAP.MQES-derived queue depth 1024
new firmware: 224005F0, CAP.MQES-derived queue depth 128

One way to hit this path is to activate the new firmware and then reset
the controller:

nvme fw-download /dev/nvmeX -f fw_bin.tar
nvme fw-activate /dev/nvmeX -s 2 -a 1
nvme reset /dev/nvmeX

When the I/O queue depth derived from CAP.MQES became smaller after that
reset, the driver failed to recover any usable I/O queues. In our
production kernel this was logged as:

nvme nvme0: IO queues lost

The namespaces were then removed, so the corresponding block device
disappeared. The opposite direction is less visible: if the CAP-derived
depth becomes larger, reset can complete without an error and the block
device can remain usable, but the live queue depth state is not updated
consistently.

The reason is that reset updates only part of the live queue depth state.
The nvme-pci reset path disables the controller, re-enables it, re-reads
CAP, and recalculates:

dev->q_depth
ctrl->sqsize

from the new CAP.MQES value. Later, however, nvme_create_io_queues()
reuses the existing struct nvme_queue entries. nvme_alloc_queue()
returns immediately when the queue already exists, so the old values
remain in:

nvmeq->q_depth
nvmeq->cq_dma_addr
nvmeq->sq_dma_addr

Create CQ/SQ then requests queues with nvmeq->q_depth entries (encoded
in the command as nvmeq->q_depth - 1) and uses the old SQ/CQ DMA
addresses, not the newly computed dev->q_depth. The blk-mq side also
keeps the old depth: the reset path updates the number of hardware
queues through blk_mq_update_nr_hw_queues(), but it does not resize the
existing tag set or update its queue_depth.

This explains the observed shrink failure and the expected grow case:

* If the CAP-derived depth becomes smaller, the driver may try to create
an I/O queue with the old larger nvmeq->q_depth. A controller that
now enforces the smaller CAP.MQES-derived limit can reject the Create
CQ/SQ command. If no I/O queues are recovered, nvme-pci removes the
namespaces, so the block device disappears.

* If the CAP-derived depth becomes larger, the old nvmeq->q_depth is
still within the new controller limit. Queue creation can therefore
succeed and the device can remain usable, but the live state is
inconsistent: dev->q_depth and ctrl->sqsize reflect the new capability
while nvmeq queue resources and the blk-mq tag set still reflect the
old depth. The larger depth is not used until the controller is
removed and probed again.

There are two broad ways to address this.

The direct fix would be to make reset recovery handle a changed live queue
depth. That would require updating or rebuilding the nvmeq depth and
SQ/CQ DMA allocations, and resizing the block-layer depth state
consistently, including the blk-mq tag set, scheduler tags when present,
and queue->nr_requests. That is broader than an nvme-pci-only change
and needs block layer review.

This RFC instead takes the smaller approach of detecting the reset-time
CAP.MQES change and making it visible. If the live I/O queue depth
shrinks, reset recovery is failed before recreating I/O queues. If it
grows, the driver warns and continues with the existing queue resources.

Feedback would be appreciated on whether this detection is useful on its
own, or whether nvme-pci should instead support full live queue-depth
resizing together with the required blk-mq changes.

Guzebing (1):
nvme-pci: detect I/O queue depth changes after reset

drivers/nvme/host/pci.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

--
2.20.1