Re: [Bug 219609] File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX + Ryzen 8700G

From: Stefan
Date: Thu Feb 06 2025 - 10:58:58 EST


Hi,

after Matthias was so kind (more than me) to make a video (!) for the
ASRock support, and after I once again referred to this thread and the
many users who have the same problem, ASRock is able to reproduce the
issues.

Ralph, all tests in comment #40 (including the network issue) where run
twice, because I did not collect logs and lspci outputs the first time.
(The corruptions seem to depend on which PCIe devices / lanes (?) are
used. That's why I also included the lspci outputs.)

(As announced in initial message, I cannot run tests ATM and for a while.)

Regards Stefan


Am 03.02.25 um 19:48 schrieb Stefan:
Hi,

just got feedback from ASRock. They asked me to make a video from the
corruptions occurring on my remotely (and headless) running system.
Maybe I should make video of printing out the logs that can be found an
the Linux and Debian bug trackers ...

Seems that ASRock is unwilling to solve the problem.

Regards Stefan


Am 28.01.25 um 15:24 schrieb Stefan:
Hi,

Am 28.01.25 um 13:52 schrieb Dr. David Alan Gilbert:
Is there any characterisation of the corrupted data; last time I
looked at the bz there wasn't.

Yes, there is. (And I already reported it at least on the Debian bug
tracker, see links in the initial message.)

f3 reports overwritten sectors, i.e. it looks like the pseudo-random
test pattern is written to wrong position. These corruptions occur in
clusters whose size is an integer multiple of 2^17 bytes in most cases
(about 80%) and 2^15 in all cases.

The frequency of these corruptions is roughly 1 cluster per 50 GB
written.

Can others confirm this or do they observe a different characteristic?

Regards Stefan


I mean, is it reliably any of:
    a) What's the size of the corruption?
           block, cache line, word, bit???
    b) Position?
           e.g. last word in a block or something?
    c) Data?
           pile of zero's/ff's junk/etc?

    d) Is it a missed write, old data, or partially written block?

Dave

Puh.  I'm kinda lost on what we could do about this on the Linux
side.

Because it also depends on the CPU series, a firmware or hardware issue
seems to be more likely than a Linux bug.

ATM ASRock is still trying to reproduce the issue. (I'm in contact with
them to. But they have Chinese new year holidays in Taiwan this week.)

If they can't reproduce it, they have to provide an explanation why the
issues are seen by so many users.

Regards Stefan