Re: Kernel 5.15 doesn't detect SATA drive on boot

From: Krzysztof Wilczyński
Date: Wed Nov 17 2021 - 04:36:20 EST


Hi Marc,

[...]
> > > I think that this problem is due to recent PCI subsystem changes which broke Mac
> > > support. The problem show up as the interrupts not being delivered, which in
> > > turn result in the kernel assuming that the drive is not working (see the
> > > timeout error messages in your dmesg output). Hence your boot drive detection
> > > fails and no rootfs to mount.
> > >
> > > Adding linux-pci list.
> > >
> > >
> > >
> > > >
> > > > Regards.
> > > >
> > > > [1] https://archlinux.org/packages/core/x86_64/linux/
> > > > [2] https://bugs.archlinux.org/task/72734
> >
> > The error in the dmesg output (see [2] where the log file is attached)
> > looks similar to the problem reported a week or so ago, as per:
> >
> > https://lore.kernel.org/linux-pci/ee3884db-da17-39e3-4010-bcc8f878e2f6@xxxxxxxxxxx/
> >
> > The problematic commits where reverted by Bjorn and the Pull Request that
> > did it was accepted, as per:
> >
> > https://lore.kernel.org/linux-pci/20211111195040.GA1345641@bhelgaas/
> >
> > Thus, this would made its way into 5.16-rc1, I suppose. We might have to
> > back-port this to the stable and long-term kernels.
> >
> > Yuji, could you, if you have some time to spare, try the 5.16-rc1 to see if
> > this have gotten better on your system?
>
> I'm afraid you have the wrong end of the stick on this one.
>
> The issue is reported on 5.15, and the issue you are pointing at was
> introduced during the 5.16 merge window. The problematic commit wasn't
> reverted, but instead fixed in 10a20b34d735 ("of/irq: Don't ignore
> interrupt-controller when interrupt-map failed").

Ahh. My bad! I missed the conclusion of the conversation involving the
Nemo board and the patch you proposed here:

https://lore.kernel.org/linux-pci/87mtma8udh.wl-maz@xxxxxxxxxx/

I then assumed that what Bjorn reverted in his Pull Request was the
solution to the reported problems. Apologies for conflating the issues
here, and also thank you for all the details.

Are we still in need to back-port some of the fixes to the stable and LTS
kernels then? I am just making sure that things will make it there, if
needed.

> The issue is instead very close to the one reported at [1], for which
> we have a very conservative workaround in 5.16-rc1 (commits
> 2226667a145d and f21082fb20db). Looking at the dmesg log provided by
> Yugi, you find the following nugget:
>
> [ 0.378564] pci 0000:00:0a.0: [10de:0d88] type 00 class 0x010601
>
> Oh look, a NVIDIA AHCI controller, probably similar enough to the one
> discussed in the issue reported by Rui.

Good to know for the future reference that these can be problematic.

> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 003950c738d2..cd88eddf614d 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5857,3 +5857,4 @@ static void nvidia_ion_ahci_fixup(struct pci_dev *pdev)
> pdev->dev_flags |= PCI_DEV_FLAGS_HAS_MSI_MASKING;
> }
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0ab8, nvidia_ion_ahci_fixup);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0d88, nvidia_ion_ahci_fixup);

Thank you! I hope this will fix Yuji's issues.

Krzysztof