Re: [Bug 219609] File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX + Ryzen 8700G

From: Stefan
Date: Thu Jan 09 2025 - 10:44:22 EST


Hi,

due to Thorstens hints, I'm trying to reply to both, the bug tracker and
the mailing list.

--- Comment #13 from Keith Busch (kbusch@xxxxxxxxxx) ---
If I'm summarizing correctly, we're seeing corruption on Lexar, Kingston,
and now Samsung NVMe's?

The Kingston read errors may be something different. They are described
in detail in messages #108 and #113 of the Debian Bug Tracker
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1076372

With the Kington, I never saw the write errors that occur with Lexar and
Samsung on newer Kernels (and which are easy to reproduce).

(ATM I cannot provide test results from the Kingston SSD because the
Lexar is installed, the PC is installed remotely and in use. Thus I
can't swap the SSDS that often.)

# cat /sys/block/nvme0n1/queue/fua

Returns "1"

--- Comment #15 from Keith Busch (kbusch@xxxxxxxxxx) --- as a test,
could you turn off the volatile write cache?

# sudo nvme set-feature /dev/nvme0n1 -f 6 -v 0
Had to modify that a little bit:

$ nvme get-feature /dev/nvme0n1 -f 6
get-feature:0x06 (Volatile Write Cache), Current value:0x00000001
$ nvme set-feature /dev/nvme0 -f 6 /dev/nvme0n1 -v 0
set-feature:0x06 (Volatile Write Cache), value:00000000,
cdw12:00000000, save:0
$ nvme get-feature /dev/nvme0n1 -f 6
get-feature:0x06 (Volatile Write Cache), Current value:00000000

Corruptions disappear (under 6.13.0-rc6) if volatile write cache is
disabled (and appear again if I turn it on with "-v 1").

But, lspci says I have a

Shenzhen Longsys Electronics Co., Ltd. Lexar NM790 NVME SSD
(DRAM-less) (rev 01) (prog-if 02 [NVM Express])

Note the "DRAM-less". This is confirmed by
https://www.techpowerup.com/ssd-specs/lexar-nm790-4-tb.d1591. Instead of
this, the SSD has a (*non-*volatile) SLC write cache and it uses 40 MB
Host-Memory-Buffer (HMB).

May there be an issue with the HMB allocation/usage ?

Is the mainboard firmware involved into HMB allocation/usage ? That
would explain, why volatile write caching via HMB works in the 2nd M.2
socket.

BTW, controller is MaxioTech MAP1602A, which is different from the
Samsung controllers.

--- Comment #14 from Bruno Gravato (bgravato@xxxxxxxxx) --- The only
difference in the specs between the two M.2 slots is that one is
gen5x4 (the main one, which is the one with problems) and the other
is gen4x4 (this works fine, no errors).

AFAIK this primary M.2 socket is connected to dedicated PCIe lanes of
the CPU. On my PC, it runs in Gen4 mode (limited by SSD).

The secondary M.2 socket on the rear side is probably connected to PCIe
lanes which are usually used by a chipset -- but that socket works.

Regards Stefan