Re: [PATCH] pci: Add a few new IDs for Intel GPU "spurious interrupt" quirk

From: Bin Meng
Date: Mon Oct 08 2018 - 05:47:34 EST


Hi Bjorn,

On Thu, Oct 4, 2018 at 4:12 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Thu, Sep 27, 2018 at 10:10:07AM +0800, Bin Meng wrote:
> > On Thu, Sep 27, 2018 at 12:57 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > On Wed, Sep 26, 2018 at 08:14:01AM -0700, Bin Meng wrote:
> > > > Add more PCI IDs to the Intel GPU "spurious interrupt" quirk table,
> > > > which are known to break.
> > >
> > > Do you have a reference for this? Any public bug reports, bugzilla,
> > > Intel spec reference or errata? "Which are known to break" is pretty
> > > vague.
> >
> > Sorry I used wrong words and should have been clearer. These devices
> > are validated to be broken. The test I used is very simple, just
> > unplug the VGA cable and plug it again, and "spurious interrupt" will
> > be seen on the interrupt line of the IGD device. I was not aware of
> > any public bugs filed to Intel, nor seen any errata from Intel.
>
> The original commit, f67fd55fa96f ("PCI: Add quirk for still enabled
> interrupts on Intel Sandy Bridge GPUs"), says some systems "crash"
> (not sure if that means an oops or an actual crash that requires a
> reboot) and on other systems, Linux disables the shared interrupt
> line. I assume disabling the interrupt line keeps devices using that
> line from working, but does not directly cause a crash.
>

Correct, disable the shared interrupt line keeps all devices using
that line from working, which is current kernel's behavior w/o this
quirk handling: it disables the (shared) interrupt line after 100.000+
generated interrupts. But the side effect is that other devices become
unusable after that (eg: USB devices which share the same interrupt
line with the Intel GPU). That's why the original commit, f67fd55fa96f
("PCI: Add quirk for still enabled interrupts on Intel Sandy Bridge
GPUs") disables the GPU's interrupt directly, which should really be
done by the VGA BIOS itself (a buggy VBIOS!).

> What specific symptom do you see here? I think it might be useful to
> collect details, e.g., dmesg logs, /proc/interrupts contents, output
> of "sudo lspci -vv", etc., for the systems you're quirking here. I'm
> hoping we can eventually figure out a solution that doesn't require a
> quirk for every new GPU, and maybe that info will help find it.
>

The symptom was described briefly in the original commit f67fd55fa96f
too, that disables the (shared) interrupt line after 100.000+
generated interrupts (can be observed via /proc/interrupts).

> > > > See commit f67fd55fa96f ("PCI: Add quirk for still enabled interrupts
> > > > on Intel Sandy Bridge GPUs"), and commit 7c82126a94e6 ("PCI: Add new
> > > > ID for Intel GPU "spurious interrupt" quirk") for some history.
> > > >
> > > > Based on current findings, it is highly possible that all Intel
> > > > 1st/2nd/3rd generation Core processors' IGD has such quirk.
> > >
> > > Can you include a reference to these "current findings"? I assume you
> > > have bug reports that include the device IDs you're adding? If not,
> > > how did you build this list of new IDs?
> >
> > By "current findings" I mean given the IDs we have here, plus previous
> > one added by Thomas, it's highly possible this VGA BIOS bug exists in
> > every 1st/2nd/3rd generation Core processors.
> >
> > > The function comment added by f67fd55fa96f ("PCI: Add quirk for still
> > > enabled interrupts on Intel Sandy Bridge GPUs") suggests that this is
> > > actually a BIOS issue, not a hardware erratum, i.e., I don't see
> > > anything there that suggests a hardware defect.
> > >
> > > But there must be a hole somewhere -- the kernel can't be expected to
> > > disable interrupts in device-specific ways when there's no driver
> > > loaded. Maybe it's simply a BIOS defect or maybe there's some
> > > interrupt or _PRT-related setup we're missing.
> >
> > It's a pure VGA BIOS bug, not the BIOS bug or _PRT etc. The VGA BIOS
> > forgot to turn off the interrupt on these devices.
>
> If this is a VGA BIOS defect, it's not very likely that it will
> magically be fixed for all new Intel GPUs, so in effect it sounds like
> we need to update this list of quirks in Linux every time a new Intel
> GPU comes out. That prospect is a little daunting.
>

I don't have a relatively newer Intel board at hand for testing right
now. I can try to locate one. But as I said, it's highly possible at
least all 1st/2nd/3rd generation Core processors are affected. Maybe
we can add all these known GPU devices of 1st/2nd/3rd generation Core
processors all together for now? For newer GPUs, let's wait until
someone reports the issue again?

> Do you happen to know if Windows has the same problem? I.e., if you
> boot an old version of Windows with a new GPU, and unplug the VGA
> cable, does Windows crash? If Windows can figure out how to handle
> that situation gracefully, Linux should be able to do it, too.
>

I suspect Windows cannot handle it too. Without the GPU awareness, the
interrupt line is simply on and no driver claims the devices and will
cause issues. I can test this.

> > > > Signed-off-by: Bin Meng <bmeng.cn@xxxxxxxxx>
> > > > Cc: <stable@xxxxxxxxxxxxxxx> # v3.4+
> > > > ---
> > > >
> > > > drivers/pci/quirks.c | 4 ++++
> > > > 1 file changed, 4 insertions(+)
> > > >
> > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > > index 6bc27b7..c0673a7 100644
> > > > --- a/drivers/pci/quirks.c
> > > > +++ b/drivers/pci/quirks.c
> > > > @@ -3190,7 +3190,11 @@ static void disable_igfx_irq(struct pci_dev *dev)
> > > >
> > > > pci_iounmap(dev, regs);
> > > > }
> > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0042, disable_igfx_irq);
> > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0046, disable_igfx_irq);
> > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x004a, disable_igfx_irq);
> > > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0102, disable_igfx_irq);
> > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0106, disable_igfx_irq);
> > > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x010a, disable_igfx_irq);
> > > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0152, disable_igfx_irq);
> > > >
> > > > --

Regards,
Bin