Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?

From: Nix
Date: Wed Sep 19 2012 - 15:24:59 EST


So I have this x86-64 server running Linux 3.5.1 with a SATA-on-PCIe
Areca 1210 hardware RAID-5 controller driven by libata which has been
humming along happily for years -- but suddenly, today, the entire
machine froze for a couple of minutes (or at least fs access froze),
followed by this in the logs:

Sep 19 16:55:47 spindle notice: [3447524.381843] arcmsr0: abort device command of scsi id = 0 lun = 1
[... repeated a few times at intervals over the next five minutes,
followed by a mass of them at 16:59:29, and...]
Sep 19 16:59:25 spindle err: [3447657.821450] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 33
Sep 19 16:59:25 spindle notice: [3447697.878386] arcmsr0: wait 'abort all outstanding command' timeout
Sep 19 16:59:25 spindle notice: [3447697.878628] arcmsr0: executing hw bus reset .....
Sep 19 16:59:25 spindle err: [3447698.287054] irq 16: nobody cared (try booting with the "irqpoll" option)
Sep 19 16:59:25 spindle warning: [3447698.287291] Pid: 0, comm: swapper/4 Not tainted 3.5.1-dirty #1
Sep 19 16:59:25 spindle warning: [3447698.287522] Call Trace:
Sep 19 16:59:25 spindle warning: [3447698.287754] <IRQ> [<ffffffff810af5ba>] __report_bad_irq+0x31/0xc2
Sep 19 16:59:25 spindle warning: [3447698.288031] [<ffffffff810af84e>] note_interrupt+0x16a/0x1e8
Sep 19 16:59:25 spindle warning: [3447698.288263] [<ffffffff810ad9d5>] handle_irq_event_percpu+0x163/0x1a5
Sep 19 16:59:25 spindle warning: [3447698.288497] [<ffffffff810ada4f>] handle_irq_event+0x38/0x55
Sep 19 16:59:25 spindle warning: [3447698.288727] [<ffffffff810b01a0>] handle_fasteoi_irq+0x78/0xab
Sep 19 16:59:25 spindle warning: [3447698.288960] [<ffffffff8103631c>] handle_irq+0x24/0x2a
Sep 19 16:59:25 spindle warning: [3447698.289189] [<ffffffff81036229>] do_IRQ+0x4d/0xb4
Sep 19 16:59:25 spindle warning: [3447698.289419] [<ffffffff815070e7>] common_interrupt+0x67/0x67
Sep 19 16:59:25 spindle warning: [3447698.289648] <EOI> [<ffffffff812ab174>] ? acpi_idle_enter_c1+0xcb/0xf2
Sep 19 16:59:25 spindle warning: [3447698.289919] [<ffffffff812ab152>] ? acpi_idle_enter_c1+0xa9/0xf2
Sep 19 16:59:25 spindle warning: [3447698.290152] [<ffffffff813c1446>] cpuidle_enter+0x12/0x14
Sep 19 16:59:25 spindle warning: [3447698.290382] [<ffffffff813c1902>] cpuidle_idle_call+0xc5/0x175
Sep 19 16:59:25 spindle warning: [3447698.290614] [<ffffffff8103c2da>] cpu_idle+0x5b/0xa5
Sep 19 16:59:25 spindle warning: [3447698.290844] [<ffffffff81ad4fcb>] start_secondary+0x1a2/0x1a6
Sep 19 16:59:25 spindle err: [3447698.291074] handlers:
Sep 19 16:59:25 spindle err: [3447698.291294] [<ffffffff8133b9a3>] usb_hcd_irq
Sep 19 16:59:25 spindle emerg: [3447698.291553] Disabling IRQ #16
Sep 19 16:59:25 spindle err: [3447710.888187] arcmsr0: waiting for hw bus reset return, retry=0
Sep 19 16:59:25 spindle err: [3447720.882155] arcmsr0: waiting for hw bus reset return, retry=1
Sep 19 16:59:25 spindle notice: [3447730.896410] Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
Sep 19 16:59:25 spindle err: [3447730.916348] arcmsr: scsi bus reset eh returns with success

This is the first SCSI (that is, um, ATA) bus reset I have *ever* had on
this machine, hence my concern. (The IRQ disable we can ignore: it was
just bad luck that an interrupt destined for the Areca hit after the
controller had briefly vanished from the PCI bus as part of resetting.)

Now just last week another (surge-protected) machine on the same power
main as it died without warning with a fried power supply which
apparently roasted the BIOS and/or other motherboard components before
it died (the ACPI DSDT was filled with rubbish, and other things must
have been fried because even with ACPI off Linux wouldn't boot more than
one time out of a hundred (freezing solid at different places in the
boot each time). So my worry level when this SCSI bus reset turned up
today is quite high. It's higher given that the controller logs
(accessed via the Areca binary-only utility for this purpose) show no
sign of any problem at all.

EDAC shows no PCI bus problems and no memory problems, so this probably
*is* the controller.

So... is this a serious problem? Does anyone know if I'm about to lose
this controller, or indeed machine as well? (I really, really hope not.)

I'd write this off as a spurious problem and not report it at all, but
I'm jittery as heck after the catastrophic hardware failure last week,
and when this happens in close proximity, I worry.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/