Re: [PATCH] PCI: vmd: Delay interrupt handling on MTL VMD controller
From: Manivannan Sadhasivam
Date: Fri Sep 13 2024 - 07:12:00 EST
On Fri, Sep 13, 2024 at 01:55:49PM +0800, Kai-Heng Feng wrote:
> Hi Nirmal,
>
> On Fri, Sep 13, 2024 at 1:45 AM Nirmal Patel
> <nirmal.patel@xxxxxxxxxxxxxxx> wrote:
> >
> > On Fri, 6 Sep 2024 09:56:59 +0800
> > Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx> wrote:
> >
> > > On Wed, Sep 4, 2024 at 2:22 PM Manivannan Sadhasivam
> > > <manivannan.sadhasivam@xxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Sep 04, 2024 at 09:57:08AM +0800, Kai-Heng Feng wrote:
> > > > > On Tue, Sep 3, 2024 at 10:51 PM Keith Busch <kbusch@xxxxxxxxxx>
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Sep 03, 2024 at 03:07:45PM +0800, Kai-Heng Feng wrote:
> > > > > > > On Tue, Sep 3, 2024 at 12:29 PM Manivannan Sadhasivam
> > > > > > > <manivannan.sadhasivam@xxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 03, 2024 at 10:55:44AM +0800, Kai-Heng Feng
> > > > > > > > wrote:
> > > > > > > > > Meteor Lake VMD has a bug that the IRQ raises before the
> > > > > > > > > DMA region is ready, so the requested IO is considered
> > > > > > > > > never completed: [ 97.343423] nvme nvme0: I/O 259 QID 2
> > > > > > > > > timeout, completion polled [ 97.343446] nvme nvme0: I/O
> > > > > > > > > 384 QID 3 timeout, completion polled [ 97.343459] nvme
> > > > > > > > > nvme0: I/O 320 QID 4 timeout, completion polled [
> > > > > > > > > 97.343470] nvme nvme0: I/O 707 QID 5 timeout, completion
> > > > > > > > > polled
> > > > > > > > >
> > > > > > > > > The is documented as erratum MTL016 [0]. The suggested
> > > > > > > > > workaround is to "The VMD MSI interrupt-handler should
> > > > > > > > > initially perform a dummy register read to the MSI
> > > > > > > > > initiator device prior to any writes to ensure proper
> > > > > > > > > PCIe ordering." which essentially is adding a delay
> > > > > > > > > before the interrupt handling.
> > > > > > > >
> > > > > > > > Why can't you add a dummy register read instead? Adding a
> > > > > > > > delay for PCIe ordering is not going to work always.
> > > > > > >
> > > > > > > This can be done too. But it can take longer than 4us delay,
> > > > > > > so I'd like to keep it a bit faster here.
> > > > > >
> > > > > > An added delay is just a side effect of the read. The read
> > > > > > flushes pending device-to-host writes, which is most likely
> > > > > > what the errata really requires. I think Mani is right, you
> > > > > > need to pay that register read penalty to truly fix this.
> > > > >
> > > > > OK, will change the quirk to perform dummy register read.
> > > > >
> > > > > But I am not sure which is the "MSI initiator device", is it VMD
> > > > > controller or NVMe devices?
> > > > >
> > > >
> > > > 'MSI initiator' should be the NVMe device. My understanding is that
> > > > the workaround suggests reading the NVMe register from the MSI
> > > > handler before doing any write to the device to ensures that the
> > > > previous writes from the device are flushed.
> > >
> > > Hmm, it would be really great to contain the quirk in VMD controller.
> > > Is there anyway to do that right before generic_handle_irq()?
> > >
> > The bug is in hardware, I agree with Kai-Heng to contain it to VMD
> > controller.
>
I'd love to, but if I read the workaround correctly, it suggests reading the
register of the MSI initiator device, which is NVMe. IDK, how you can read the
NVMe register from the VMD driver.
> The problem I am facing right now is that I can't connect the virq to
> NVMe's struct device to implement the quirk.
>
> Do you have any idea how to achieve that?
>
> Kai-Heng
>
> >
> > > >
> > > > And this sounds like the workaround should be done in the NVMe
> > > > driver as it has the knowledge of the NVMe registers. But isn't the
> > > > NVMe driver already reading CQE status first up in the ISR?
> > >
> > > The VMD interrupt is fired before the CQE status update, hence the
> > > bug.
I'm not able to understand the bug properly. The erratum indicates that the MSI
from device reaches the VMD before other writes to the registers. So this is an
ordering issue as MSI takes precedence over other writes from the device.
So the workaround is to read the device register in the MSI handler to make sure
the previous writes from the device are flushed. IIUC, once the MSI reaches the
VMD, it will trigger the IRQ handler in the NVMe driver and in the handler, CQE
status register is read first up. This flow matches with the workaround
suggested.
Is any write being performed to the NVMe device before reading any register in
the MSI handler? Or the current CQE read is not able to satisfy the workaround?
Please clarify.
- Mani
--
மணிவண்ணன் சதாசிவம்