nvidia controller failed command, possibly related to SMARTselftest (2.6.32)

From: martin f krafft
Date: Sat Mar 13 2010 - 04:34:01 EST


Hello,

I swapped in a new motherboard into a server that was previously
having the occasional SATA hiccoughs[0]. It didn't last 24 hours
before I got the next set of troubles:

0. http://marc.info/?l=linux-kernel&m=125654588201284&w=2

kernel: [45091.756037] ata4: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
kernel: [45091.756042] ata4: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
kernel: [45091.756043] dhfis 0x1 dmafis 0x0 sdbfis 0x0
kernel: [45091.756046] ata4: ATA_REG 0x40 ERR_REG 0x0
kernel: [45091.756048] ata4: tag : dhfis dmafis sdbfis sacitve
kernel: [45091.756051] ata4: tag 0x0: 1 0 0 1
kernel: [45091.756063] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
kernel: [45091.756068] ata4.00: failed command: WRITE FPDMA QUEUED
kernel: [45091.756074] ata4.00: cmd 61/08:00:07:30:e1/00:00:01:00:00/40 tag 0 ncq 4096 out
kernel: [45091.756075] res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
kernel: [45091.756077] ata4.00: status: { DRDY }
kernel: [45091.756085] ata4: hard resetting link
kernel: [45091.756087] ata4: nv: skipping hardreset on occupied port
kernel: [45097.264713] ata4: link is slow to respond, please be patient (ready=0)
kernel: [45101.800044] ata4: SRST failed (errno=-16)
[â]
kernel: [45151.900793] ata4: reset failed, giving up
kernel: [45151.900797] ata4.00: disabled
kernel: [45151.900851] sd 3:0:0:0: [sdd] Unhandled error code
kernel: [45151.900853] sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
kernel: [45151.900856] sd 3:0:0:0: [sdd] CDB: Write(10): 2a 00 01 e1 30 07 00 00 08 00
kernel: [45151.900864] end_request: I/O error, dev sdd, sector 31535111
kernel: [45151.900870] raid1: Disk failure on sdd2, disabling device.
kernel: [45151.900871] raid1: Operation continuing on 1 devices.

How do I learn how to interpret such kernel logs?
Does it suggest anything about who's at fault?

If it's of any relevance, the problems also occured with 2.6.26, but
the RAID code didn't always eject the disks on that kernel; the
first time I encountered a degraded array due to this was shortly
after the upgrade to 2.6.32. However, this is speculation, I have
not verified the causality.


All this happened at 2:09am, which made me wonder about smartd, and
indeed this is the time I scheduled SMART self-tests on the device.

What's more: I can reproduce the problem at will, e.g. run a short
SMART self-test and a RAID resync on the device at the same time,
and boom!

However, I can only reproduce this on two disks, which are on
separate SATA controller channels ata2 and ata4, which makes me
think that the problems are with the disks, not with the controller
(ata1 and ata3 stand up fine to the stress test)

Generally, SMART self-tests should be a transparent operation that
doesn't affect the operating system's use of the devices, right? Is
it conceivable or even common that the disks' own controllers are
broken to the point where they fall over SMART tests?

Thank you for any feedback,

--
martin | http://madduck.net/ | http://two.sentenc.es/

due to lack of interest tomorrow has been cancelled.

spamtraps: madduck.bogus@xxxxxxxxxxx

Attachment: digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)