Re: [Bug 219609] File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX + Ryzen 8700G
From: Bruno Gravato
Date: Tue Feb 04 2025 - 04:13:19 EST
On Tue, 4 Feb 2025 at 06:12, Christoph Hellwig wrote:
>
> On Sun, Feb 02, 2025 at 08:32:31AM +0000, Bruno Gravato wrote:
> > In my tests I was using real data: a backup of my files.
> >
> > On one such test I copied over 300K files, variables sizes and types
> > totalling about 60GB. A bit over 20 files got corrupted.
> > I tried copying the files over the network (ethernet) using rsync/ssh.
> > I also tried restoring the files using restic (over ssh as well). And
> > I also tried copying the files locally from a SATA disk. In all cases
> > I got similar results with some files being corrupted.
> > The destination nvme disk was using btrfs and running btrfs scrub
> > after the copy detects quite a few checksum errors.
>
> So you used various different data sources, and the desintation was
> always the nvme device in the suspect slot.
>
Yes, regardless of the data source, the destination was always a
single nvme disk on the main M.2 nvme slot, with the secondary M.2
nvme slot empty.
I tried 3 different disks (WD, Crucial and Solidigm) with similar results.
If I put any of those disks on the secondary M.2 slot (with the main
slot empty) the problem doesn't occur.
The one that intrigues me most is if I put 2 nvme disks in, occupying
both M.2 slots, the problem doesn't occur either.
The secondary slot must be empty for the issue to happen.
I didn't try using the main M.2 slot as source instead of target, to
see if the problem also occurs on reading as well.
I could try that if you think it's worth testing.
> > I analyzed some of those corrupted files and one of them happened to
> > be a text file (linux kernel source code).
> > A big portion of the text was replaced with text from another file in
> > the same directory (being text made it easy to find where it came
> > from).
> > So this was a contiguous block of text that was overwritten with a
> > contiguous block of text from another file.
> > If I remember correctly the other file was not corrupted (so the
> > blocks weren't swapped). It looked like a certain block of text was
> > written twice: on the correct file and on another file in the same
> > directory.
>
> That's a very interesting pattern.
>
> > I also got some jpeg images corrupted. I was able to open and view
> > (partially) those images and it looked like a portion of the image was
> > repeated in a different part of it), so blocks of the same file were
> > probably duplicated and overwritten within itself.
> >
> > The blocks being overwritten seemed to be different sizes on different files.
>
> This does sound like a fairly common pattern due to SSD FTL issues,
> but I still don't want to rule out swiotlb, which due to the bucketing
> could maybe also lead to these, but I can't really see how. But the
> fact that the affected systems seem to be using swiotlb despite no
> good reason for them to do so still leaves me puzzled.
>