RE: [PATCH] net/mlx5: poll mlx5 eq during irq migration

From: Praveen Kannoju

Date: Sat Mar 07 2026 - 00:45:53 EST

Confidential - Oracle Restricted +AFw-Including External Recipients

Confidential - Oracle Restricted +AFw-Including External Recipients
+AD4- -----Original Message-----
+AD4- From: Jason Gunthorpe +ADw-jgg+AEA-ziepe.ca+AD4-
+AD4- Sent: Saturday, March 7, 2026 4:40 AM
+AD4- To: Praveen Kannoju +ADw-praveen.kannoju+AEA-oracle.com+AD4-
+AD4- Cc: saeedm+AEA-nvidia.com+ADs- leon+AEA-kernel.org+ADs- tariqt+AEA-nvidia.com+ADs-
+AD4- mbloch+AEA-nvidia.com+ADs- andrew+-netdev+AEA-lunn.ch+ADs- davem+AEA-davemloft.net+ADs-
+AD4- edumazet+AEA-google.com+ADs- kuba+AEA-kernel.org+ADs- pabeni+AEA-redhat.com+ADs-
+AD4- netdev+AEA-vger.kernel.org+ADs- linux-rdma+AEA-vger.kernel.org+ADs- linux-
+AD4- kernel+AEA-vger.kernel.org+ADs- Rama Nichanamatlu
+AD4- +ADw-rama.nichanamatlu+AEA-oracle.com+AD4AOw- Manjunath Patil
+AD4- +ADw-manjunath.b.patil+AEA-oracle.com+AD4AOw- Anand Khoje +ADw-anand.a.khoje+AEA-oracle.com+AD4-
+AD4- Subject: Re: +AFs-PATCH+AF0- net/mlx5: poll mlx5 eq during irq migration
+AD4-
+AD4- On Fri, Mar 06, 2026 at 02:19:09PM +-0000, Praveen Kannoju wrote:
+AD4- +AD4-
+AD4- +AD4- +AD4- On Thu, Mar 05, 2026 at 05:08:52PM +-0000, Praveen Kannoju wrote:
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- Regardless of the underlying causes, which may include IRQ loss
+AD4- +AD4- +AD4- +AD4- or EQ re-arming failure, the TX queue becomes stuck, and the
+AD4- +AD4- +AD4- +AD4- timeout handler is only triggered once the queue is declared
+AD4- +AD4- +AD4- +AD4- full. In scenarios where only specialized packets, such as
+AD4- +AD4- +AD4- +AD4- heartbeat packets, are sent through the queue, it takes
+AD4- +AD4- +AD4- +AD4- significantly longer for the queue to fill and be identified as
+AD4- +AD4- +AD4- +AD4- stuck. A proven solution for this issue is polling the EQ
+AD4- +AD4- +AD4- +AD4- immediately after the corresponding IRQ migration, which allows
+AD4- +AD4- +AD4- +AD4- for earlier recovery and prevents the transmission queue from
+AD4- +AD4- +AD4- +AD4- becoming stuck.
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- I undersand all of this, but for upstreaming we want the root cause,
+AD4- +AD4- +AD4- not bodges like this.
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- There is no reason to do what this patch does, the IRQ system is not
+AD4- +AD4- +AD4- supposed to loose interrupts on migration, if that is happening on
+AD4- +AD4- +AD4- your systems it is a serious bug that must be root caused.
+AD4- +AD4-
+AD4- +AD4- Thank you, Jason.
+AD4- +AD4- We'll evaluate more on it.
+AD4-
+AD4- If this is in a VM running under qemu - qemu does Lots Of Stuff whenever a
+AD4- MSI-X is changed and that stuff has been buggy before and resulted in lost
+AD4- things.
+AD4-
+AD4- If it is bare metal, I'm shocked. Maybe an IOMMU driver bug in the interrupt
+AD4- remapping?
+AD4-
+AD4- Jason

Hello Jason,

Yes, this is Qemu VM, on which we are having the issue.
In bare metal environment there is NO cpu scaling.
No issue seen on a bare metal so far. Maybe unlikely also.

We are having this issue as Qemu VM's go thru cpu scaling
based on business needs.

It had been very challenging to arrive at the cause.
we went thru many live debug sessions with Nvidia R+ACY-D team.
but we couldn't root cause. This tells why we eventually.
arrived at this mitigation as this issue is wide spread
and has been hurting many and many customers in cloud.

-
Praveen