Re: Intel ICH9M/M-E SATA error-handling/reset problems

From: Tejun Heo
Date: Thu Feb 19 2009 - 01:29:58 EST


Hello, Serguei.

Serguei Miridonov wrote:
>>>> I agree with you completely. Nevertheless, something like 10
>>>> errors per 2GB transfer can not be the reason to give up. Vista,
>>>> at least, recovers and continues the data transfer. Linux simply
>>>> can not return the interface or connected device into operating
>>>> mode. Do you think it is normal?
>> Well, there isn't much point in keeping retrying if the same
>> command fails consecutively.
>
> I'm not talking about the _same_ transfer command. I mean intermittent
> errors, average 10 parity errors per 2GB file. Let me repeat myself
> from another post:
>
> ... my very strong opinion based just on general physics is that
> error rate on SATA can be (and will be) much higher than that one on
> PATA. PATA operates at lower frequencies and cables are much shorter.
> eSATA cables are longer and work at up to 3Gb/s. Moreover, consider
> all these consumer-grade connectors, cables, etc. So, CRC errors could
> be quite common and software needs to handle them properly to keep
> transfers fast and maintain the communication with a device.

The kernel doesn't give up after intermittent errors.

> And, remember USB bulk transfer? Who is taking care on CRC check and
> retries there?

What you're describing is already handled. No need to worry about it.

>> The problem was the broken speed down
>> logic, so all the retries failed and FS eventually received IO
>> failure. Should have been fixed with recent changes.
>
> Slow down may help to reduce amount of errors but it may happen that
> they can not be avoided completely.
>
>> In the log, ata2.00 went down after a timeout. The reset per-se
>> isn't the problem and is the RTTD after a timeout as the controller
>> and device states are unknown. The situations like yours in the
>> log often happens because an ATAPI device shuts down completely
>> after certain transmission problems. When this happens, there's
>> nothing much the driver can do and soft reboot wouldn't recover the
>> device either.
>
> So, this is the kernel job to keep things working, not break them :-)

Yeah, and other than the hardware quirkiness on your machine, it
already works fine.

>> But seeing you're on dv5, I think you might be experiencing
>> something else. Please take a look at the following bz.
>>
>> http://bugzilla.kernel.org/show_bug.cgi?id=12276
>
> Yes, I tried to suspend to RAM and when the laptop waked up it failed
> to communicate with the hard drive. So, I use hibernate instead.

Can you please try to take a look at the kernel log after the kernel
resumes and see whether you're actually seeing the same problem?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/