Re: Intel ICH9M/M-E SATA error-handling/reset problems

From: Serguei Miridonov
Date: Mon Feb 16 2009 - 11:17:55 EST


Hello,

On Sunday 15 February 2009, Tejun Heo wrote:
> Please try shorter (or different) cable.

I will, in a few days, may be.

> >> I agree with you completely. Nevertheless, something like 10
> >> errors per 2GB transfer can not be the reason to give up. Vista,
> >> at least, recovers and continues the data transfer. Linux simply
> >> can not return the interface or connected device into operating
> >> mode. Do you think it is normal?
>
> Well, there isn't much point in keeping retrying if the same
> command fails consecutively.

I'm not talking about the _same_ transfer command. I mean intermittent
errors, average 10 parity errors per 2GB file. Let me repeat myself
from another post:

... my very strong opinion based just on general physics is that
error rate on SATA can be (and will be) much higher than that one on
PATA. PATA operates at lower frequencies and cables are much shorter.
eSATA cables are longer and work at up to 3Gb/s. Moreover, consider
all these consumer-grade connectors, cables, etc. So, CRC errors could
be quite common and software needs to handle them properly to keep
transfers fast and maintain the communication with a device.

And, remember USB bulk transfer? Who is taking care on CRC check and
retries there?

> The problem was the broken speed down
> logic, so all the retries failed and FS eventually received IO
> failure. Should have been fixed with recent changes.

Slow down may help to reduce amount of errors but it may happen that
they can not be avoided completely.

> In the log, ata2.00 went down after a timeout. The reset per-se
> isn't the problem and is the RTTD after a timeout as the controller
> and device states are unknown. The situations like yours in the
> log often happens because an ATAPI device shuts down completely
> after certain transmission problems. When this happens, there's
> nothing much the driver can do and soft reboot wouldn't recover the
> device either.

So, this is the kernel job to keep things working, not break them :-)

> But seeing you're on dv5, I think you might be experiencing
> something else. Please take a look at the following bz.
>
> http://bugzilla.kernel.org/show_bug.cgi?id=12276

Yes, I tried to suspend to RAM and when the laptop waked up it failed
to communicate with the hard drive. So, I use hibernate instead.

> ... I'm trying to
> contact HP about this but hasn't gotten anywhere yet.

Please, let us know if they reply.

Thank you.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/