Re: [PATCH net] net/mlx5e: Skip NAPI polling when PCI channel is offline

From: Tariq Toukan

Date: Wed Feb 11 2026 - 06:27:19 EST




On 09/02/2026 20:01, Breno Leitao wrote:
When a PCI error (e.g. AER error or DPC containment) marks the PCI
channel as frozen or permanently failed, the IOMMU mappings for the
device may already be torn down. If mlx5e_napi_poll() continues
processing CQEs in this state, every call to dma_unmap_page() triggers
a WARN_ON in iommu_dma_unmap_phys().

In a real-world crash scenario on an NVIDIA Grace (ARM64) platform,
a DPC event froze the PCI channel and the mlx5 NAPI poll continued
processing error CQEs, calling dma_unmap for each pending WQE. Here is
an example:

The DPC event on port 0007:00:00.0 fires and eth1 (on 0017:01:00.0) starts
seeing error CQEs almost immediately:

pcieport 0007:00:00.0: DPC: containment event, status:0x2009
mlx5_core 0017:01:00.0 eth1: Error cqe on cqn 0x54e, ci 0xb06, ...

The WARN_ON storm begins ~0.4s later and repeats for every pending WQE:

WARNING: CPU: 32 PID: 0 at drivers/iommu/dma-iommu.c:1237 iommu_dma_unmap_phys
Call trace:
iommu_dma_unmap_phys+0xd4/0xe0
mlx5e_tx_wi_dma_unmap+0xb4/0xf0
mlx5e_poll_tx_cq+0x14c/0x438
mlx5e_napi_poll+0x6c/0x5e0
net_rx_action+0x160/0x5c0
handle_softirqs+0xe8/0x320
run_ksoftirqd+0x30/0x58

After 23 seconds of WARN_ON() storm, the watchdog fires:

watchdog: BUG: soft lockup - CPU#32 stuck for 23s! [ksoftirqd/32:179]
Kernel panic - not syncing: softlockup: hung tasks

Each unmap hit the WARN_ON in the IOMMU layer, printing a full stack
trace. With dozens of pending WQEs, this created a storm of WARN_ON
dumps in softirq context that monopolized the CPU for over 23 seconds,
triggering a soft lockup panic.

Fix this by checking pci_channel_offline() at the top of
mlx5e_napi_poll() and bailing out immediately when the channel is
offline. napi_complete_done() is called before returning to clear the
NAPI_STATE_SCHED bit, ensuring that napi_disable() in the teardown path
does not spin forever waiting for it. No CQ interrupts are re-armed
since the explicit mlx5e_cq_arm() calls are skipped, so the NAPI
instance will not be re-scheduled. The pending DMA buffers are left for
device removal to clean up.

Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files")
Signed-off-by: Breno Leitao <leitao@xxxxxxxxxx>
---
drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 76108299ea57d..934ad7fafa801 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -138,6 +138,19 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
bool xsk_open;
int i;
+ /*
+ * When the PCI channel is offline, IOMMU mappings may already be torn
+ * down. Processing CQEs would call dma_unmap for every pending WQE,
+ * each hitting a WARN_ON in the IOMMU layer. The resulting storm of
+ * warnings in softirq context can monopolise the CPU long enough to
+ * trigger a soft lockup and prevent any RCU grace period from
+ * completing.
+ */
+ if (unlikely(pci_channel_offline(c->mdev->pdev))) {
+ napi_complete_done(napi, 0);
+ return 0;
+ }
+
rcu_read_lock();
qos_sqs = rcu_dereference(c->qos_sqs);

---
base-commit: a956792a1543c2bf4a2266cb818dc7c4135006f0
change-id: 20260209-mlx5_iommu-c8b238b1bb14

Best regards,
--
Breno Leitao <leitao@xxxxxxxxxx>



Hi,

Thanks for your patch.

You're introducing an interesting problem, but I am not convinced by this solution approach.

Why would the driver perform this check if it doesn't guarantee prevention of invalid access? It only "allows one napi cycle", which happen to be good enough to prevent the soft lockup in your case.

What if a napi cycle is configured with larger budget?

If the problem is that the WARN_ON is being called at a high rate, then it should be rate-limited.