Re: hdd errors with libata drivers

From: Robert Hancock
Date: Mon Jun 29 2009 - 20:37:38 EST


On 06/29/2009 06:45 AM, Marcin Niskiewicz wrote:
Hello!
I have 2 identical machines - both with 3 disks (WDC WD3000HLFS) -
root filesystem is under raid1, data partitions are in raid5 (using
mdadm)
gentoo, kernel version - 2.6.25-hardened-r8, ahci driver for disks...
reiserfs as filesystem...
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH)
6 port SATA AHCI Controller (rev 02)
Intel(R) Xeon(R) CPU X3360

About 4 months ago both machines died in the same way - due to problem
with disks - both raid5-s were down, data filesystem was
unreachable... (the root filesystem survived)

I thought that it was sth linked with power supply or sth similar - so
I made some changes to avoid the problem ...

But few days ago it happened again - at the SAME time - BOTH machines
had problems with disks! (again root filesystem survived, data
partition was corrupted and raid5 was unreachable)

In dmesg I noticed something like this:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000001
ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 51/04:00:34:cf:f3/00:00:00:f3:40/a3 Emask 0x1 (device error)

Here the drive is returning command aborted to a cache flush request, suggesting it's having problems writing to the media.

ata1.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000008
ata1.00: cmd 60/08:08:f7:23:8a/00:00:0b:00:00/40 tag 1 ncq 4096 in
res 41/40:00:f7:23:8a/21:00:0b:00:00/4b Emask 0x409 (media error)<F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/133
ata1: EH complete

And here it's returning an uncorrectable media error to an NCQ read.


On both machines dmesg errors were about ata1.00 ...

Due to http://ata.wiki.kernel.org/index.php/Libata_error_messages it
looks like hardware problem - but 6 disks in two machines - at the
same time again?
I checked all of disks with WD tools before going to production and
everything was OK... It's really strange ....

I found opinions that it could be kernel bug on ata acpi - and that I
should add noacpi or noapic option - is it true? wouldn't it have any
affects (performance etc.) to Intel CPU?

It seems highly unlikely that this is a kernel bug. My guess would be something common to both machines, maybe a power problem, etc.


I'm thinking about changing kernel version - maybe not hardened ...

Any ideas?

Thanks for any help!

regards
nichu

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/