Re: [Bug 219609] File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX + Ryzen 8700G
From: Christoph Hellwig
Date: Tue Feb 04 2025 - 01:14:28 EST
On Sun, Feb 02, 2025 at 08:32:31AM +0000, Bruno Gravato wrote:
> In my tests I was using real data: a backup of my files.
>
> On one such test I copied over 300K files, variables sizes and types
> totalling about 60GB. A bit over 20 files got corrupted.
> I tried copying the files over the network (ethernet) using rsync/ssh.
> I also tried restoring the files using restic (over ssh as well). And
> I also tried copying the files locally from a SATA disk. In all cases
> I got similar results with some files being corrupted.
> The destination nvme disk was using btrfs and running btrfs scrub
> after the copy detects quite a few checksum errors.
So you used various different data sources, and the desintation was
always the nvme device in the suspect slot.
> I analyzed some of those corrupted files and one of them happened to
> be a text file (linux kernel source code).
> A big portion of the text was replaced with text from another file in
> the same directory (being text made it easy to find where it came
> from).
> So this was a contiguous block of text that was overwritten with a
> contiguous block of text from another file.
> If I remember correctly the other file was not corrupted (so the
> blocks weren't swapped). It looked like a certain block of text was
> written twice: on the correct file and on another file in the same
> directory.
That's a very interesting pattern.
> I also got some jpeg images corrupted. I was able to open and view
> (partially) those images and it looked like a portion of the image was
> repeated in a different part of it), so blocks of the same file were
> probably duplicated and overwritten within itself.
>
> The blocks being overwritten seemed to be different sizes on different files.
This does sound like a fairly common pattern due to SSD FTL issues,
but I still don't want to rule out swiotlb, which due to the bucketing
could maybe also lead to these, but I can't really see how. But the
fact that the affected systems seem to be using swiotlb despite no
good reason for them to do so still leaves me puzzled.