Re: [PATCH] net/mlx5: poll mlx5 eq during irq migration
From: Jason Gunthorpe
Date: Fri Mar 06 2026 - 18:12:16 EST
On Fri, Mar 06, 2026 at 02:19:09PM +0000, Praveen Kannoju wrote:
>
> > On Thu, Mar 05, 2026 at 05:08:52PM +0000, Praveen Kannoju wrote:
> >
> > > Regardless of the underlying causes, which may include IRQ loss
> > > or EQ re-arming failure, the TX queue becomes stuck, and the
> > > timeout handler is only triggered once the queue is declared
> > > full. In scenarios where only specialized packets, such as
> > > heartbeat packets, are sent through the queue, it takes
> > > significantly longer for the queue to fill and be identified as
> > > stuck. A proven solution for this issue is polling the EQ
> > > immediately after the corresponding IRQ migration, which allows
> > > for earlier recovery and prevents the transmission queue from
> > > becoming stuck.
> >
> > I undersand all of this, but for upstreaming we want the root cause, not
> > bodges like this.
> >
> > There is no reason to do what this patch does, the IRQ system is not supposed
> > to loose interrupts on migration, if that is happening on your systems it is a
> > serious bug that must be root caused.
>
> Thank you, Jason.
> We'll evaluate more on it.
If this is in a VM running under qemu - qemu does Lots Of Stuff
whenever a MSI-X is changed and that stuff has been buggy before and
resulted in lost things.
If it is bare metal, I'm shocked. Maybe an IOMMU driver bug in the
interrupt remapping?
Jason