Re: Fwd: Lexar NM790 SSDs are not recognized anymore after 6.1.50 LTS

From: Linux regression tracking (Thorsten Leemhuis)
Date: Tue Sep 05 2023 - 12:35:44 EST


On 05.09.23 16:35, Keith Busch wrote:
> On Tue, Sep 05, 2023 at 01:37:36PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 04.09.23 13:07, Bagas Sanjaya wrote:
>>>
>>> I notice a regression report on Bugzilla [1]. Quoting from it:
>>>
>>>> I bought a new 4 TB Lexar NM790 and I was using kernel 6.3.13 at the time. It wasn't recognized, with these messages in dmesg:
>>>>
>>>> [ 358.950147] nvme nvme0: pci function 0000:06:00.0
>>>> [ 358.958327] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
>>>>
>>>> My other NVMe appears correctly in the nvme list though.
>>>>
>>>>
>>>> So I tried using other kernels I had installed at the time: 6.3.7, 6.4.10, 6.5.0rc6, 6.5.0, 6.5.1 and none of these recognized the disk.
>>>> I installed the 6.1.50 lts kernel from arch repositories (I can compile my own too if this would be an issue) and then the device was correctly recognized:
>>>>
>>>> [ 4.654613] nvme 0000:06:00.0: platform quirk: setting simple suspend
>>>> [ 4.654632] nvme nvme0: pci function 0000:06:00.0
>>>> [ 4.667290] nvme nvme0: allocated 40 MiB host memory buffer.
>>>> [ 4.709473] nvme nvme0: 16/0/0 default/read/poll queues
>>
>> FWIW, the quoted mail missed one crucial detail:
>> """
>> Claudio Sampaio 2023-09-02 19:04:29 UTC
>>
>> Adding the two lines
>>
>> │ 3457 { PCI_DEVICE(0x1d97, 0x1602), /* Lexar NM790 */
>> │ 3458 │ .driver_data = NVME_QUIRK_BOGUS_NID, },
>>
>> in file drivers/nvme/host/pci.c made my NVMe work correctly. Compiled a
>> new 6.5.1 kernel and everything works.
>> """
>>
>> @NVME maintainers: is there anything more you need from Claudio at this
>> point?
>
> Yes: it doesn't really make any sense. The report says the device
> stopped showing up with message:
>
> nvme nvme0: Device not ready; aborting initialisation, CSTS=0x0
>
> That (a) happens long before the mentioned quirk is considered by the
> driver, and (b) the "quirk" behavior is now the default in 6.5 and
> several of the listed stable kernels anyway.
>
> It more likely sounds like the device is flaky and either never becomes
> ready due to some unspecified internal firmware condition, or
> inaccurately reports how long it actually needs to become ready in
> worst-case-scenario.

Thx, I kinda suspected something like that, but I kept my mouth shut, as
I feared comments from the cheap seats might be more harmful then helpful.

But what can Claudio do to find the root cause? Check hardware
(especially the connectors), update firmware, ...? And if that doesn't
lead to anything, bisect the issue?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.