failed command FLUSH CACHE EXT (was: Re: via 8237 sata errors)

From: Thomas Fjellstrom
Date: Wed Jun 02 2010 - 17:11:03 EST


On May 30, 2010, Thomas Fjellstrom wrote:
> On May 30, 2010, Thomas Fjellstrom wrote:
> > On May 30, 2010, Robert Hancock wrote:
> > > On Sun, May 30, 2010 at 1:33 PM, Thomas Fjellstrom
> > >
> > > <tfjellstrom@xxxxxxxxxxxxxxx> wrote:
> > > > On May 30, 2010, Thomas Fjellstrom wrote:
> > > >> On May 30, 2010, Robert Hancock wrote:
> > > >> > On 05/29/2010 08:46 PM, Thomas Fjellstrom wrote:
> > > >> > > I'm getting a rather nasty set of messages from dmesg when
> > > >> > > trying to use a couple SATA II WD 2TB Green drives with an
> > > >> > > older system (via 8237 based). They seem to work fine on a
> > > >> > > newer p35 based system.
> > > >> >
> > > >> > ..
> > > >> >
> > > >> > > [ 283.308963] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0
> > > >> > > action 0x0 [ 283.309007] ata2.00: BMDMA stat 0x4
> > > >> > > [ 283.309045] ata2.00: failed command: READ DMA
> > > >> > > [ 283.309089] ata2.00: cmd
> > > >> > > c8/00:08:08:08:30/00:00:00:00:00/e0 tag 0 dma 4096 in [
> > > >> > > 283.309091] res
> > > >> > > 41/04:00:08:08:30/00:00:00:00:00/e0 Emask 0x1 (device error) [
> > > >> > > 283.309171] ata2.00: status: { DRDY ERR }
> > > >> > > [ 283.309207] ata2.00: error: { ABRT }
> > > >> > > [ 283.324886] ata2.00: configured for UDMA/133
> > > >> > > [ 283.324904] ata2: EH complete
> > > >> >
> > > >> > It's not really clear why the drive is returning command aborted
> > > >> > on a read, there's no other error bits to indicate an
> > > >> > uncorrectable error or a CRC error. Is it only the one drive
> > > >> > that's giving the errors?
> > > >>
> > > >> I'm not entirely sure if its the same drive each time. I can make
> > > >> sure today. The fun part is it works fine in a different machine.
> > > >> Where as it will start erroring out like that almost right away in
> > > >> the via based machine. When its doing that, its also making some
> > > >> fairly scary (for a hard drive) noises, but since it doesn't do
> > > >> that in the p35 machine I'm really hoping it isn't the drive.
> > > >
> > > > I've started up a dd on each drive, just for kicks, and reading
> > > > from both of them at the same time seems to work fine on the via
> > > > chipset. Given this little tid-bit, it seems its only once md-raid
> > > > is setup on the drives does one of them freak out.
> > >
> > > If the problem happens mostly when both drives are in use then it
> > > could be a power supply issue. Some drives are rather sensitive to
> > > power fluctuations. You could try and move the drives to separate
> > > power cables if they're on the same one, or maybe the power supply's
> > > just not up to snuff.
> >
> > I'll give that a shot too. But I have dd running on both drives right
> > now (dd if=/dev/sdX of=/dev/null), and its running fine. I'm also
> > going to run some other tests, see if maybe there isn't a ram problem,
> > it could be the cause.
>
> Interesting, after moving each WD drive to its own power cable, that
> particular problem seems to have gone away. However another problem seems
> to have popped up, which I'm not sure is related so I'll post another
> thread (unable to handle kernel paging request).
>
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > linux-kernel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/

Ok, more testing, I've moved the drives over to the p35 machine semi-
permanently, and after a day or so of uptime I got some new errors:

ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata3.00: failed command: FLUSH CACHE EXT
ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata3.00: status: { DRDY }
ata3: hard resetting link
ata3: link is slow to respond, please be patient (ready=0)
ata3: SRST failed (errno=-16)
ata3: hard resetting link
ata3: link is slow to respond, please be patient (ready=0)
ata3: SRST failed (errno=-16)
ata3: hard resetting link
ata3: link is slow to respond, please be patient (ready=0)
ata3: SRST failed (errno=-16)
ata3: limiting SATA link speed to 1.5 Gbps
ata3: hard resetting link
ata3: SRST failed (errno=-16)
ata3: reset failed, giving up
ata3.00: disabled
ata3.00: device reported invalid CHS sector 0
ata3: EH complete
end_request: I/O error, dev sdc, sector 0
sd 2:0:0:0: [sdc] Unhandled error code
sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 00 00 07 a7 00 00 08 00
end_request: I/O error, dev sdc, sector 1959
Buffer I/O error on device dm-0, logical block 189
lost page write due to I/O error on dm-0
end_request: I/O error, dev sdc, sector 0
end_request: I/O error, dev sdc, sector 0
JBD2 unexpected failure: do_get_write_access: buffer_uptodate(jh2bh(jh));
Possible IO failure.

end_request: I/O error, dev sdc, sector 0
end_request: I/O error, dev sdc, sector 0
------------[ cut here ]------------
WARNING: at /home/damentz/src/zen/main/linux-
liquorix-2.6-2.6.34/debian/build/source_amd64_none/fs/buffer.c:1199
mark_buffer_dirty+0x74/0x90()
Hardware name: P5K SE
Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6
acpi_cpufreq cpufreq_ondemand freq_table cpufreq_conservative
cpufreq_userspace cpufreq_powersave af_packet ext3 jbd loop
snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss
snd_mixer_oss snd_pcm rtc_cmos rtc_core snd_timer tpm_tis nvidia(P) tpm
rtc_lib tpm_bios evdev snd intel_agp pcspkr asus_atk0110 soundcore i2c_i801
snd_page_alloc button i2c_core processor dm_mod raid10 raid456
async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx
raid1 raid0 multipath linear md_mod ext4 mbcache jbd2 crc16 usbhid sd_mod
ata_generic pata_acpi uhci_hcd ata_piix libata floppy scsi_mod thermal atl1
mii ehci_hcd [last unloaded: scsi_wait_scan]
Pid: 3283, comm: jbd2/dm-0-8 Tainted: P 2.6.34-0.dmz.8-liquorix-
amd64 #1
Call Trace:
[<ffffffff8103bf23>] ? warn_slowpath_common+0x73/0xb0
[<ffffffff81101cd4>] ? mark_buffer_dirty+0x74/0x90
[<ffffffffa005a3c9>] ? __jbd2_journal_unfile_buffer+0x9/0x20 [jbd2]
[<ffffffffa005d8a3>] ? jbd2_journal_commit_transaction+0xba3/0x12d0 [jbd2]
[<ffffffff810542d0>] ? autoremove_wake_function+0x0/0x30
[<ffffffffa0061701>] ? kjournald2+0xb1/0x210 [jbd2]
[<ffffffff810542d0>] ? autoremove_wake_function+0x0/0x30
[<ffffffffa0061650>] ? kjournald2+0x0/0x210 [jbd2]
[<ffffffff81053e3e>] ? kthread+0x8e/0xa0
[<ffffffff81033e8d>] ? schedule_tail+0x4d/0xf0
[<ffffffff81003c94>] ? kernel_thread_helper+0x4/0x10
[<ffffffff81053db0>] ? kthread+0x0/0xa0
[<ffffffff81003c90>] ? kernel_thread_helper+0x0/0x10
---[ end trace c90e4c710c9ef513 ]---
end_request: I/O error, dev sdc, sector 0

(and plenty more dmesg lines from lvm and ext4/jbd2 screaming about the io
commands failing)

I take it that this means the drive is likely pooched? I'm going to try some
more tests, and make sure both of the WD drives are on their own power cable
first. but I'm betting now that the drive is just failing. This would make 2
out of 4 in the same batch that had issues. The first one would increase the
sector reallocated count 4 every hour or so. Now this one fails a flush
cache command (and other spurious errors).

I guess its time to break out the WD diagnostics disk.

--
Thomas Fjellstrom
tfjellstrom@xxxxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/