Re: Fwd: Need NVME QUIRK BOGUS for SAMSUNG MZ1WV480HCGL-000MV (Samsung SM-953 Datacenter SSD)
From: Linux regression tracking (Thorsten Leemhuis)
Date: Tue Jul 11 2023 - 05:39:52 EST
[CCing Linus for the "whack a mole" aspect in the second half]
On 11.07.23 08:54, Pankaj Raghav wrote:
>>> I understand that, but I think we need middlemen for that, as I or Bagas
>>> don't have the contacts -- and it's IMHO also a bit much too ask us for
>>> in general, as regression tracking is hard enough already. At least
>>> unless this becomes something that happen regularly, then a list of
>>> persons we could contact would be fine I guess. But we simply can't deal
>>> with too many subsystem specific special cases.
>>
>> I'm not asking the Linux regression trackers to fill that role, though.
Well, during our work we often encounter those bugs -- often from people
that are no regular developers that already had a hard time
understanding the issue and reporting it to us somehow. Asking those to...
>> I'm asking people who experience these issues report it to their vendor
...find the right destination and format to report their Linux problems
to the vendors is unlikely to fly I suspect. And I'm not sure if that is
in our interest, as then it might take a lot longer to get those quirk
entries into the kernel source.
But whatever, the main reason why I write this mail is different:
>> directly because these device makers apparently have zero clue that
>> their spec non-compliance is causing painful experiences for their
>> customers and annoyance for maintainers. They keep pumping out more and
>> more devices with the same breakage.
>>
>> This particular vendor has been great at engaging with Linux, but that's
>> not necessarily normal among all device makers, and I don't have
>> contacts with the majority of the vendors we've had to quirk for this
>> issue.
>>
>> We did complain to the NVMe spec workgroup that their complaince cert
>> suite is not testing for this. There was a little initial interest in
>> fixing that gap, but it fizzled out...
Preface: this is not my area of expertise, and maybe I should keep my
mouth shut. But whatever.
Well, that "They keep pumping out more and more devices with the same
breakage" and the "new device" comment from Pankaj below bear the
question: should we stop trying to play "whack a mole" with all those
quirk entries and handle devices with duplicate ids just like Windows does?
That would "make things just work"(tm).
And yes, I suspect there are good reasons why we went down the "quirk"
route or why abandoning it might be hard. But maybe it's time to
reconsider that path, as from my outside point of view things sound a
lot like they are somewhat similar to the ACPI problems we dealt with
~15 years ago: we learned that we have to deal with broken ACPI
implementations and somehow use them in a way similar to how Windows
uses them, as that's the OS the machine was designed for and tested with.
Ciao, Thorsten
>>> Another request came in today, even with a pseudo-patch:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=217649
>>>
>>> To quote:
>>> ```
>>> As with numerous NVMe controllers these days, Samsung's
>>> MZAL41T0HBLB-00BL2, which Lenovo builds into their 16ARP8 also suffers
>>> from invalid IDs, breaking suspend and hibernate also on the latest
>>> kernel 6.4.2.
> [...]
>> Panjaj, okay with this one too?
>
> This looks a like a new device that might have a firmware update. I will ping
> internally first.
>
> As you mentioned, the recent addition of globally unique ID check
> is breaking a lot of devices because of non-compliant firmware. I will try to create
> some awareness about this issue internally as well.