Re: [PATCH] nvme-pci: fix potential I/O hang when CQ is full

From: Junnan Zhang

Date: Wed Feb 11 2026 - 04:53:16 EST

On Tue, 10 Feb 2026 16:57:12 +0100, Christoph Hellwig wrote:

> We can't update the CQ head before consuming the CQEs, otherwise
> the device can reuse them. And devices must not discard completions
> when there is no completion queue entry, nvme does allow SQs and CQs
> to be smaller than the number of outstanding commands.

Updating the CQ head before consuming the CQE would not cause the device to
reuse these entries, as new commands can only be submitted by the driver after
the CQE is consumed. Therefore, the device does not have the opportunity
to reuse these entries.

Actually, the root cause of the issue is that the underlying device received
more commands from the NVMe driver than the queue depth (q_depth), leading
to a CQ full problem.

In my environment, the NVMe admin queue depth is 32, allowing a maximum of
32 commands to be processed concurrently. During the NVMe disk removal process,
the NVMe driver sends commands via the admin queue to delete all I/O queues.
When the NVMe driver has already submitted more than 32 commands, any additional
commands beyond 32 will wait for the previous ones to complete.

During NVMe interrupt handling, the current implementation first processes the
CQE and then updates the CQ head. The commands allocated by nvme_delete_queue
are not processed through the batch flow during interrupt response. After
consuming the CQE, the tag is released and the upper-layer NVMe driver is notified
(note: at this point, the CQ head has not yet been updated, meaning the entire
previous process is not yet complete). Upon receiving the notification, the NVMe
driver immediately submits new commands to the SQ. When the underlying device
completes command processing and writes the result back to the CQ (while the CQ
head remains unupdated), the number of commands processed by the underlying device
exceeds the NVMe queue depth. Since there is no available space in the CQ to place
the completion, a CQ full error is reported.

The above process can be illustrated by the following diagram:

driver irq underlying(virtual/hardware)
------ ------ ------
1. Wait for tag
1. Read CQE CQ is full, wait for head update
2. Handle CQE
3. Wake up tag
2. Get tag
(blk_mq_put_tag)
3. Issue new cmd
1. Process cmd
2. Try write to CQ
3. CQ is full, discard cmd!
4. Update CQ head
(LATE!)
4. Cmd timeout

Best regards,
Junnan Zhang