Re: Kernel 5.15 doesn't detect SATA drive on boot

From: Marc Zyngier
Date: Wed Nov 17 2021 - 04:07:59 EST


Hi Krzysztof, Yugi,

On Tue, 16 Nov 2021 23:26:18 +0000,
Krzysztof Wilczyński <kw@xxxxxxxxx> wrote:
>
> [+CC Arnd, Bjorn, Marc and Sasha for visibility]
>
> Hello Damien and Yuji,
>
> [...]
> > > I'm using Arch Linux on MacBook Air 2010. I updated `linux` package[1]
> > > from v5.14.16 to v5.15.2 the other day, and the boot process stalled
> > > with the following message.
> > >
> > > ```shell
> > > :: running early hook [udev]
> > > Starting version 249.6-3-arch
> > > :: running hook [udev]
> > > :: Triggering uevents...
> > > Waiting 10 seconds for device /dev/sda3 ...
> > > ERROR: device '/dev/sda3' not found. Skipping fsck.
> > > :: mounting '/dev/sda' on real root
> > > mount: /new_root: no filesystem type specified.
> > > You are now being dropped into an emergency shell.
> > > sh: can't access tty; job control turned off
> > > [rootfs ]#
> > > ```
> > >
> > > In the emergency shell there's no `sda` devices when I type `$ ls
> > > /dev/`. By downgrading the kernel, boot process works properly.
> > >
> > > See also Arch Linux bug tracker[2]. There are similar reports on
> > > Apple devices.
> > >
> > > `dmesg` output in the emergency shell is attached. I guess this issue is
> > > related to libata, so CCed to Damien Le Moal.
> >
> > I think that this problem is due to recent PCI subsystem changes which broke Mac
> > support. The problem show up as the interrupts not being delivered, which in
> > turn result in the kernel assuming that the drive is not working (see the
> > timeout error messages in your dmesg output). Hence your boot drive detection
> > fails and no rootfs to mount.
> >
> > Adding linux-pci list.
> >
> >
> >
> > >
> > > Regards.
> > >
> > > [1] https://archlinux.org/packages/core/x86_64/linux/
> > > [2] https://bugs.archlinux.org/task/72734
>
> The error in the dmesg output (see [2] where the log file is attached)
> looks similar to the problem reported a week or so ago, as per:
>
> https://lore.kernel.org/linux-pci/ee3884db-da17-39e3-4010-bcc8f878e2f6@xxxxxxxxxxx/
>
> The problematic commits where reverted by Bjorn and the Pull Request that
> did it was accepted, as per:
>
> https://lore.kernel.org/linux-pci/20211111195040.GA1345641@bhelgaas/
>
> Thus, this would made its way into 5.16-rc1, I suppose. We might have to
> back-port this to the stable and long-term kernels.
>
> Yuji, could you, if you have some time to spare, try the 5.16-rc1 to see if
> this have gotten better on your system?

I'm afraid you have the wrong end of the stick on this one.

The issue is reported on 5.15, and the issue you are pointing at was
introduced during the 5.16 merge window. The problematic commit wasn't
reverted, but instead fixed in 10a20b34d735 ("of/irq: Don't ignore
interrupt-controller when interrupt-map failed").

The issue is instead very close to the one reported at [1], for which
we have a very conservative workaround in 5.16-rc1 (commits
2226667a145d and f21082fb20db). Looking at the dmesg log provided by
Yugi, you find the following nugget:

[ 0.378564] pci 0000:00:0a.0: [10de:0d88] type 00 class 0x010601

Oh look, a NVIDIA AHCI controller, probably similar enough to the one
discussed in the issue reported by Rui.

Yugi, could you please test the patch below on top of 5.16-rc1?

Thanks,

M.

[1] https://lore.kernel.org/r/CALjTZvbzYfBuLB+H=fj2J+9=DxjQ2Uqcy0if_PvmJ-nU-qEgkg@xxxxxxxxxxxxxx


diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 003950c738d2..cd88eddf614d 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5857,3 +5857,4 @@ static void nvidia_ion_ahci_fixup(struct pci_dev *pdev)
pdev->dev_flags |= PCI_DEV_FLAGS_HAS_MSI_MASKING;
}
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0ab8, nvidia_ion_ahci_fixup);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, 0x0d88, nvidia_ion_ahci_fixup);

--
Without deviation from the norm, progress is not possible.