Re: [Bug 219609] File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX + Ryzen 8700G
From: Bruno Gravato
Date: Fri Jan 10 2025 - 06:18:09 EST
Hi,
(resending in text-only mode, because mailing lists don't like HMTL
emails... sorry to those getting this twice)
I can reply via email, that's not a problem.
I'll try to run some more tests when I get the chance (it's been a
very busy week, sorry).
Besides the volatile write cache test, any other test I should try?
Regarding the M.2 slots. I believe this motherboard has no chipset. So
both slots should be connected directly to the CPU (mine is Ryzen
8600G), although they might be connecting to different parts of the
CPU, right? I guess that can make a difference.
My disks are gen4 as well.
Bruno
On Thu, 9 Jan 2025 at 15:44, Stefan <linux-kernel@xxxxxxx> wrote:
>
> Hi,
>
> due to Thorstens hints, I'm trying to reply to both, the bug tracker and
> the mailing list.
>
> > --- Comment #13 from Keith Busch (kbusch@xxxxxxxxxx) ---
> > If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston,
> > and now Samsung NVMe's?
>
> The Kingston read errors may be something different. They are described
> in detail in messages #108 and #113 of the Debian Bug Tracker
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372
>
> With the Kington, I never saw the write errors that occur with Lexar and
> Samsung on newer Kernels (and which are easy to reproduce).
>
> (ATM I cannot provide test results from the Kingston SSD because the
> Lexar is installed, the PC is installed remotely and in use. Thus I
> can't swap the SSDS that often.)
>
> > # cat /sys/block/nvme0n1/queue/fua
>
> Returns "1"
>
> > --- Comment #15 from Keith Busch (kbusch@xxxxxxxxxx) --- as a test,
> > could you turn off the volatile write cache?
> >
> > # sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0
> Had to modify that a little bit:
>
> $ nvme get-feature /dev/nvme0n1 -f 6
> get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
> $ nvme set-feature /dev/nvme0 -f 6 /dev/nvme0n1 -v 0
> set-feature:0x06 (Volatile Write Cache), value:00000000,
> cdw12:00000000, save:0
> $ nvme get-feature /dev/nvme0n1 -f 6
> get-feature:0x06 (Volatile Write Cache), Current value:00000000
>
> Corruptions disappear (under 6.13.0-rc6) if volatile write cache is
> disabled (and appear again if I turn it on with "-v 1").
>
> But, lspci says I have a
>
> Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD
> (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
>
> Note the "DRAM-less". This is confirmed by
> https://www.techpowerup.com/ssd-specs/lexar-nm790-4-tb.d1591. Instead of
> this, the SSD has a (*non-*volatile) SLC write cache and it uses 40 MB
> Host-Memory-Buffer (HMB).
>
> May there be an issue with the HMB allocation/usage ?
>
> Is the mainboard firmware involved into HMB allocation/usage ? That
> would explain, why volatile write caching via HMB works in the 2nd M.2
> socket.
>
> BTW, controller is MaxioTech MAP1602A, which is different from the
> Samsung controllers.
>
> > --- Comment #14 from Bruno Gravato (bgravato@xxxxxxxxx) --- The only
> > difference in the specs between the two M.2 slots is that one is
> > gen5x4 (the main one, which is the one with problems) and the other
> > is gen4x4 (this works fine, no errors).
>
> AFAIK this primary M.2 socket is connected to dedicated PCIe lanes of
> the CPU. On my PC, it runs in Gen4 mode (limited by SSD).
>
> The secondary M.2 socket on the rear side is probably connected to PCIe
> lanes which are usually used by a chipset -- but that socket works.
>
> Regards Stefan