Re: Problem with ata layer in 2.6.24

From: Tejun Heo
Date: Sat Feb 02 2008 - 02:13:56 EST


Kasper Sandberg wrote:
> to put some timeline perspective into this.
> i believe it was in 2005 i assembled the system, and when i realized it
> was faulty, on old ide driver, i stopped using it - that miht have been
> in beginning of 2006. then for almost a year i werent using it, hoping
> to somehow fix it, but in january 2007 i think it was, atleast in the
> very beginning of 2007, i hit upon the idea of trying libata, and ever
> since the system has been running 24/7 - doing these errors around 2
> times a day.
>
> i have multiple times reported my problems to lkml, but nothing has
> happened, i also tried to aproeach jgarzik direcly, but he was not
> interested.
>
> i really hope this can be solved now, its a huge problem
>
> my fileserver has an asus k8v motherboard, with via chipset (k8t880 i
> think it is, or something like it). currently using the promise
> controller again(strangely enough all the timeouts seems to happen here,
> and when the ITE was on, there, not the onboard one), in conjunction
> with the onboard via.

Timeouts are nasty to debug. It can be caused by whole range of
different problems including transmission errors, bad power, faulty
drive, mishandled media error, IRQ misrouting, dumb hardware bug. It's
basically 'uh... I told the controller to do something but it never
called me back'.

If you see timeouts on multiple devices connected to different
controllers, the chance is that you have problem somewhere else. The
most likely culprit is bad power. Please...

* Post the result of 'lspci -nn' and kernel log including full boot log
and error messages.

* Try to isolate the problem. ie. Does removing several number of
drives fix the problem? If the problem is localized to certain device,
what happens if you move it? Does the problem follow the drive or stay
with the port? If the failing drives are SATA, it's a good idea to
power some of the failing drives with a separate PSU and see whether
anything is different.

By trying to isolate the hardware problem, more can be learned about the
error condition and even when the problem actually isn't hardware
problem, it gives us much deeper insight of the problem and clues
regarding where to look.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/