Re: Regression from 7627a0edef54 ("ata: ahci: Drop low power policy board type") on reboot (but not cold boot)
From: Hans de Goede
Date: Mon Mar 10 2025 - 16:12:32 EST
Hi Niklas,
On 10-Mar-25 7:13 PM, Niklas Cassel wrote:
> Hello Hans,
>
> On Mon, Mar 10, 2025 at 10:34:13AM +0100, Hans de Goede wrote:
>>
>> I think that the port-mask register is only read-only from an OS pov,
>> the BIOS/UEFI/firmware can likely set it to e.g. exclude ports which are
>> not enabled on the motherboard (e.g. an M2 slot which can do both pci-e +
>> ata and is used in pci-e mode, so the sata port on that slot should be
>> ignored).
>>
>> What we seem to be hitting here is a bug where the UEFI can not detect
>> the SATA SSD after reboot if it ALPM was used by the OS before reboot and
>> the UEFI's SATA driver responds to the not detecting by clearing the bit
>> in the port-mask register.
>>
>> The UEFI not detecting the disk after reboot when ALPM was in use also
>> matches with not being able to boot from the disk after reboot.
>
> If we look at dmesg:
> ahci 0000:00:11.0: AHCI vers 0001.0200, 32 command slots, 6 Gbps, SATA mode
> ahci 0000:00:11.0: 3/3 ports implemented (port mask 0x38)
> ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part
>
> We can see that the controller supports slumber, partial,
> and aggressive link power management ("pm").
>
>
> A COMRESET is supposed to take the device out of partial or slumber.
>
> Now, we do not know if the BIOS code sends a COMRESET, but it definitely
> should.
>
> Anyway, it is stated in AHCI 1.3.1 "10.1 Software Initialization of HBA",
>
> "To aid system software during runtime, the BIOS shall ensure that the
> following registers are initialized to values that are reflective of the
> capabilities supported by the platform."
>
> "-PI (ports implemented)"
>
>
>>
>> I think what would be worth a try would be to disable ALPM on reboot
>> from a driver shutdown hook. IIRC the ALPM level can be changed at runtime
>> from a sysfs file, so we should be able to do the same at shutdown ?
>>
>> Its been a while since I last touched the AHCI code, so I hope someone else
>> can write a proof of concept patch with the shutdown handler disabling ALPM
>> on reboot ?
>
> I mean, that would be a quirk, and if such a quirk is created, it should
> only be applied for buggy BIOS versions.
>
> (Since BIOS is supposed to initialize the PI register properly.)
>
> If
> ahci.mobile_lpm_policy=1
> or
> ahci.mobile_lpm_policy=2
> works around your buggy BIOS, then I suggest you keep that
> until your BIOS vendor manages to release a new BIOS version.
I agree with you that this is a BIOS bug of the motherboard in question
and/or a bad interaction between the ATI SATA controller and Samsung SSD
870* models. Note that given the age of the motherboard there are likely
not going to be any BIOS updates fixing this though.
Certainly ahci.mobile_lpm_policy=x can be used to workaround this, but
going by my experience from being involved in resolving:
https://bugzilla.kernel.org/show_bug.cgi?id=201693
which took a long time to resolve and has many comments (1).
I'm afraid that we are going to see more users hit this. This seems
to be another case of samsung sata SSDs and ATI SATA chipsets not
liking each other but this time the problem is triggered by LPM rather
then by NCQ and we likely did not hit this the last time we were
seeing a lot of users reporting issues on this combo because so far
LPM has defaulted to off in these cases.
Note that in commit 7a8526a5cd51 which fixes the bug linked above,
we already disable all NCQ use for "Samsung SSD 870*" models when
used together with SATA controllers with a PCI vendor-id of ATI
because of various severe issues when it is enabled.
I strongly believe that to avoid further regressions from commit
7627a0edef54 ("ata: ahci: Drop low power policy board type") on
ATI SATA controller + Samsung SSD combinations we should probably
extend the special handling of ATI SATA chipsets to also disable LPM.
IOW add a new ATA_QUIRK_NO_LPM_ON_ATI flag which mirrors how
the current ATA_QUIRK_NO_NCQ_ON_ATI works but then for LPM
and set that for "Samsung SSD 870*".
I can prepare a patch for this if that sounds like an acceptable
solution to you.
Regards,
Hans
1) I don't know if you know, but I'm the author of the initial addition
of the "low power policy board type" list, because back then we
needed ALPM to reach high PC-states (e.g. PC10) on Broadwell and newer
Intel SoCs, while at the same time there were reports that ALPM was
causing. I also added quite a few of the initial NOLPM __ata_dev_quirks[]
entries.