Re: [PATCH] iommu/riscv: Replace illegal command with dummy IOFENCE to prevent hardware lockup
From: fangyu . yu
Date: Wed Jun 24 2026 - 10:11:19 EST
>When the RISC-V IOMMU encounters an illegal command, the hardware
>stops processing and the HEAD register remains pointing at the
>illegal command. If software does not handle this properly, the
>hardware will be stuck at this index indefinitely, preventing any
>further command queue operations.
>
>This patch implements a recovery mechanism by replacing the illegal
>command with a dummy IOFENCE instruction (all operands are zero):
>
>1. Prevents hardware lockup: By overwriting the illegal command with
> a valid instruction, the hardware can continue processing from the
> current position instead of being stuck.
>
>2. Enables user recovery: After replacing the illegal command, the
> user/driver has an opportunity to retry the original failed
> operation rather than losing all queued work.
>
>3. Minimal hardware impact: A dummy IOFENCE behaves as a NOP, it
> it performs no cache invalidation operations and has no side
> effects on the system state. This is the safest replacement
> instruction.
>
>Signed-off-by: Zong Li <zong.li@xxxxxxxxxx>
>---
> drivers/iommu/riscv/iommu.c | 24 +++++++++++++++++++++++-
> 1 file changed, 23 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
>index cec3ddd7ab10..6305ec5f467b 100644
>--- a/drivers/iommu/riscv/iommu.c
>+++ b/drivers/iommu/riscv/iommu.c
>@@ -464,13 +464,35 @@ static unsigned int riscv_iommu_queue_send(struct riscv_iommu_queue *queue,
> static irqreturn_t riscv_iommu_cmdq_process(int irq, void *data)
> {
> const struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
>- unsigned int ctrl;
>+ struct riscv_iommu_command cmd;
>+ unsigned int ctrl, head;
>
> /* Clear MF/CQ errors, complete error recovery to be implemented. */
> ctrl = riscv_iommu_readl(queue->iommu, queue->qcr);
> if (ctrl & (RISCV_IOMMU_CQCSR_CQMF | RISCV_IOMMU_CQCSR_CMD_TO |
> RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_FENCE_W_IP)) {
>+ /*
>+ * The head pointer is not updated by the hardware, it
>+ * still points to the index of illegal command
>+ */
>+ riscv_iommu_readl_timeout(queue->iommu, Q_HEAD(queue), head,
>+ !(head & ~queue->mask), 0,
>+ RISCV_IOMMU_QUEUE_TIMEOUT);
>+
>+ if (ctrl & RISCV_IOMMU_CQCSR_CMD_ILL) {
>+ /*
>+ * Use a dummy IOFENCE instead of the illegal command
>+ * to prevent hardware lockup
>+ */
The RISC-V IOMMU spec 1.0 (Section 5.15, cqcsr) states:
"If software makes the CQ operational again after a cmd_ill or
cqmf error, then software should resubmit the commands submitted
since the last IOFENCE.C that successfully completed."
So it seems that simply replacing the illegal command and letting the
queue continue is not sufficient.
Thanks,
Fangyu
>+ memset(&cmd, 0, sizeof(cmd));
>+ cmd.dword0 = FIELD_PREP(RISCV_IOMMU_CMD0_OPCODE,
>+ RISCV_IOMMU_CMD_IOFENCE_OPCODE);
>+ memcpy(queue->base + head * sizeof(cmd), &cmd, sizeof(cmd));
>+ dma_wmb();
>+ }
>+
> riscv_iommu_writel(queue->iommu, queue->qcr, ctrl);
>+
> dev_warn(queue->iommu->dev,
> "Queue #%u error; fault:%d timeout:%d illegal:%d fence_w_ip:%d\n",
> queue->qid,
>--
>2.43.7