Re: Silent data corruption with kernel 3.4 and FireWire disks

From: Jonathan Woithe
Date: Mon Jun 04 2012 - 20:30:44 EST


On Mon, Jun 04, 2012 at 07:28:50PM +0000, Stefan Richter wrote:
> About a week ago I noticed silent data corruptions of files on FireWire
> disks: Mount disk, read lots of data and e.g. compute their md5sum,
> unmount disk, mount disk again, read and md5sum the same files again ->
> MD5s may differ.
>
> Defects in files that were written in May hint that not only reading from
> but also writing to FireWire disks resulted in corrupt data. This was
> silent corruption without any error messages from the PCI, firewire, SCSI,
> block, or filesystem subsystems.
>
> Affected:
> - kernel 3.4
> - kernel 3.4-rc5
> Not affected:
> - kernel 3.3.1 (which I have been running now for the last 6 days)

Hmm, funny you should mention this. Over the past few months I have also
been experiencing silent corruption of a firewire disc, although I suspect
it may be for a different reason. The corruptions started occurring soon
after I upgraded a machine to kernel 2.6.39 in May 2011. The filesystem was
xfs, and when corruption occurred it generally took out the entire
filesystem (on repair, everything would be bundled unsorted into
lost+found).

The disc is written to once a day using rsync.

I removed the drive from its enclosure and ran various SMART tests on it
directly (the enclosure prevents SMART from operating). The drive showed no
pre-fail signs, passed all self-tests and didn't show any problems under
badblocks tests (read-write or destructive write).

On 18 May this year I upgraded the kernel to 3.3.6 and thus far I have not
had a repeat of the corruption. Under 2.6.39 I was usually seeing a
corruption event well within 2 weeks of recreating the filesystem, although
sometimes it took longer. Although it's early days it seems that 3.3.6 is
so far behaving better than 2.6.39.

Combined with Stefan's observations, this would indicate that there were
issues with 2.6.39, they weren't present in 3.3.x and then reappeared in
3.4. It's the disappearance and reappearance which has me thinking that
perhaps we are seeing two different problems, one of which has been fixed.

> FireWire disks with different 1394-to-SATA or -IDE bridge chips are
> affected. I noticed the problem at first on an Agere FW643e PCIe 1394
> controller which sits behind a PLX PEX 8505 PCIe switch.

In my case the enclosure was one based on the Oxford Semiconductor chipset
(911?). The drive is a PATA Western Digital 500 GB drive (00AAKB-00H8A0 - I
think from memory it's a Green drive). The firewire card is reported to be

VIA Technologies, Inc. IEEE 1394 Host Controller (rev 46)
Subsystem: VIA Technologies, Inc. IEEE 1394 Host Controller

(vendor/device ID: 1106:3044, subsystem: 1106:3044).

> - whether SATA or USB disks are affected (SATA probably not, USB not
> used yet),

The system concerned uses SATA discs for the system drives, driven by:

RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller
(rev 80)
Subsystem: ASUSTeK Computer Inc. A7V600/K8V Deluxe/K8V-X/A8V Deluxe
motherboard

I have seen no corruption on these. Once a week I am also writing to
alternating external USB2 drives (again, using rsync) and none of those have
seen this corruption either. The USB host is reported to be

USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86) (prog-if 20
[EHCI])
Subsystem: ASUSTeK Computer Inc. A7V600/K8V-X/A8V Deluxe motherboard

As I said, since there seems to be a working kernel between the version I
saw which exhibited the problem and the one where Stefan experienced an
issue, it's possible that these are two different issues (one fixed, one
still lurking). I throw the above out there in case it helps.

Regards
jonathan

PS: I'm not subscribed to lkml, but am to ieee1394-devel.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/