Huge problem with XFS/iCH7R

From: Carsten Otto
Date: Sun Jul 02 2006 - 15:54:55 EST


Hi there!

(System specs below)

Short summary:
System (with software raid 5, XFS, four disks connected to AHCI
controller) crashes very often and loses data.

My system crashes every few days, at the moment daily. The message shown
is (the drive changes about every time, I do not see a pattern here):
---
ata4: handling error/timeout
ata4: port reset, p_is 0 is 0 pis 0 cmd c017 tf 7f ss 0 se 0
ata4: status=0x50 { DriveReady SeekComplete }
sdd: Current: sense key=0x0
ASC=0x0 ASCQ=0x0
Info fid=0x0
---
Although according to this message only one of four drives failed (in
software RAID5) the system does not do anything useful. Hitting enter at
the login prompt does cause the password prompt to appear and no service
responds.

If I do a soft reset (using Magic Key u, then b) the BIOS does not detect
exactly one drive (which is the one shown in the error message I guess).
After a hard reset all drives are found, but I have to do a raid resync and
xfs_repair (at least, sometimes the raid needs to be tricked into starting).

This problem occured with all kernels (all vanilla), starting with
2.6.16.something up to 2.6.17.2.
I checked all four drives with a Maxtor tool, all drives are fine.
The temperature is not a problem, all drives are stable at about 35°C.
I replaced the SATA cables several times.

Some images showing the errors on screen are here:
http://c-otto.de/fehler/

I'd like to know what component causes this problem and how I can solve
it.

Please tell me if you need further information!

System specs:
- Intel iCH7R on Foxconn 945P7AA-EKRS2
- Pentium D 805 (2.66 GHz, 1MB Cache, Dual Core)
- 4x Maxtor 7V300F0 (MaXLine Plus III 300 GB; Sata 2; 16 MB Cache)
- 2 GB RAM

PS: Please include me in CC as I do not read the whole LKML.

Thanks a lot,
--
Carsten Otto
c-otto@xxxxxx
www.c-otto.de

Attachment: pgp00000.pgp
Description: PGP signature