I'm seeing silent write errors with two Seagate 160GB drives on a sil3112A SATA controller on an ASUS A7N8X-Deluxe motherboard, but I'm not seeing any read errors. Each drive is on it's own cable. Kernel is 2.4.23 release with the 2.4.23-libata2 patch and some patches for using MythTV applied. (I also see the same problem under 2.4.24+libata only) I'm also not seeing any error messages in any log files or the kernel dmesg output. The error rate looks to be around 1 in a million blocks. My current guess is the block or blocks just didn't get written by the drive as when the system reads in the blocks with the bad data I'm not seeing any read error messages.
I am now running the tests again using jfs as the filesystem rather than ext3 to rule out the file system as the cause. As part of this test I'm also zeroing the disks before the test and then checking for non zero data and how large the non zero data blocks are. Given enough time I may run this test a couple of times to see if the positions of the bad data move about or stay put.
How the first testes were run.
I used "cp -a * /data10" as root to copy the data (150GB worth) to the disk under test. Then I created md5sum lists for all files in both the source and destination and sorted and compaired the lists. Differences were found between the lists, some files and directories were missing, or corrupted. On fscking the disks some inodes were found corrupted. I then copied only the corrupted files and after a few iterations I finally got a copy that was the same as the source. I then ran a program to repeatadly md5sum checksum all the files and check against the source. It did not see any differences in 10 cycles.
I've attached my .config file, but don't have dmesg output to include. I will grab that when I reboot after the current tests are finished.
- Bryan