Re: amd64 sata_nv (massive) memory corruption

From: Linas Vepstas
Date: Sat Aug 02 2008 - 18:02:22 EST


2008/8/2 John Stoffel <john@xxxxxxxxxxx>:
>>>>>> "Linas" == Linas Vepstas <linasvepstas@xxxxxxxxx> writes:
>
> Linas> 2008/8/1 Alistair John Strachan <alistair@xxxxxxxxxxxxx>:
>>> On Friday 01 August 2008 18:30:34 Linas Vepstas wrote:
>>>> Hi,
>>>>
>>>> I'm seeing strong, easily reproducible (and silent) corruption on a
>>>> sata-attached
>>>> disk drive on an amd64 board. It might be the disk itself, but I
>>>> doubt it; googling
>>>> suggests that its somehow iommu-related but I cannot confirm this.
>
> Can you post the output of dmesg after a boot, so we can see which
> driver is being used? I assume the new Libata stuff, but maybe you
> can also turn on debugging in there as well. Stuff like SCSI_DEBUG
> (in the SCSI menus) might show us more details here.
>
> Also, have you tried a new SATA cable by any chance? That's obviously
> the cheaper path than getting a new disk...

I took the problematic hard drive (and its cable) to another computer
with sata ports on it, and ran my file-copy/compare/fsck tests there,
and saw no problems; so the drive itself and its cable get a clean bill
of health.

Then, rather stupidly, I flashed the latest BIOS for the motherboard
and now have a dead motherboard (it hangs on its way through BIOS,
well before the bootloader.) So I'm off to buy a new mobo today.

I'll send the dmesg from the older boots later today, if all goes well.
I'm pretty sure I had the new libata on, and the old off -- but its
possible that the .config somehow managed to pull in parts of the
old libata code anyway. I say this because, besides the SATA, the
blown motherboard had an IDE connector in use, and I also had
another PCI IDE card plugged in and in use. I'm imagining that
perhaps the PCI IDE .config might have pulled in old code, maybe
via header file, and thus mangled some lock that the sata side
was using. Just a wild guess. -- Most people on this mobo hadn't
seen problems, and unlike most people, I had the PCI IDE card
in it.

--linas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/