RE: [PATCH] net/mlx5: poll mlx5 eq during irq migration

From: Praveen Kannoju

Date: Fri Mar 06 2026 - 09:21:54 EST



Confidential - Oracle Restricted +AFw-Including External Recipients



Confidential - Oracle Restricted +AFw-Including External Recipients
+AD4- -----Original Message-----
+AD4- From: Jason Gunthorpe +ADw-jgg+AEA-ziepe.ca+AD4-
+AD4- Sent: Friday, March 6, 2026 6:02 AM
+AD4- To: Praveen Kannoju +ADw-praveen.kannoju+AEA-oracle.com+AD4-
+AD4- Cc: saeedm+AEA-nvidia.com+ADs- leon+AEA-kernel.org+ADs- tariqt+AEA-nvidia.com+ADs-
+AD4- mbloch+AEA-nvidia.com+ADs- andrew+-netdev+AEA-lunn.ch+ADs- davem+AEA-davemloft.net+ADs-
+AD4- edumazet+AEA-google.com+ADs- kuba+AEA-kernel.org+ADs- pabeni+AEA-redhat.com+ADs-
+AD4- netdev+AEA-vger.kernel.org+ADs- linux-rdma+AEA-vger.kernel.org+ADs- linux-
+AD4- kernel+AEA-vger.kernel.org+ADs- Rama Nichanamatlu
+AD4- +ADw-rama.nichanamatlu+AEA-oracle.com+AD4AOw- Manjunath Patil
+AD4- +ADw-manjunath.b.patil+AEA-oracle.com+AD4AOw- Anand Khoje +ADw-anand.a.khoje+AEA-oracle.com+AD4-
+AD4- Subject: Re: +AFs-PATCH+AF0- net/mlx5: poll mlx5 eq during irq migration
+AD4-
+AD4- On Thu, Mar 05, 2026 at 05:08:52PM +-0000, Praveen Kannoju wrote:
+AD4-
+AD4- +AD4- Regardless of the underlying causes, which may include IRQ loss
+AD4- +AD4- or EQ re-arming failure, the TX queue becomes stuck, and the
+AD4- +AD4- timeout handler is only triggered once the queue is declared
+AD4- +AD4- full. In scenarios where only specialized packets, such as
+AD4- +AD4- heartbeat packets, are sent through the queue, it takes
+AD4- +AD4- significantly longer for the queue to fill and be identified as
+AD4- +AD4- stuck. A proven solution for this issue is polling the EQ
+AD4- +AD4- immediately after the corresponding IRQ migration, which allows
+AD4- +AD4- for earlier recovery and prevents the transmission queue from
+AD4- +AD4- becoming stuck.
+AD4-
+AD4- I undersand all of this, but for upstreaming we want the root cause, not
+AD4- bodges like this.
+AD4-
+AD4- There is no reason to do what this patch does, the IRQ system is not supposed
+AD4- to loose interrupts on migration, if that is happening on your systems it is a
+AD4- serious bug that must be root caused.

Thank you, Jason.
We'll evaluate more on it.

+AD4-
+AD4- Jason