Re: sata_sil24 broken since 2.6.23-rc4-mm1

From: Tejun Heo
Date: Thu Oct 11 2007 - 02:26:46 EST


Torsten Kaiser wrote:
>>> That missing +1 would explain, why the SGE_TRM never gets set.
>> Thanks a lot for tracking this down. Does changing the above code fix
>> your problem?
>
> I did not try it.
> I'm not an libata expert and while this change looks suspicios, I
> can't be 100% sure if that change was intended.
> And I did not want to experiment this deep in the code and risk
> corrupting the hole drive.

I don't think you would risk too much by changing that bit of code.
Please try it.

>>> But I'm still not understanding, how the kernel could only fail
>>> sometimes at bootup, but after that working without any visible
>>> errors? Is the sil-chip rather intelligent about detecting corrupted
>>> sglists and silently ignoring them?
>> I have no idea why it fails only sometimes.
>
> And that is, why I'm so unsure.
> The error looks to serious to only cause random failures on one of two
> drives on bootup.
> I never had trouble with the remaining drive on the SiI-chip or both
> drives if one got killed during booting.
>
> I'm guessing that leaving the computer powered down long enough fills
> the RAM with a special pattern that really hangs the drive, while
> normaly it would just reject the invalid data. (I have ECC-RAM, does
> this matter?)
>
> Another guess might be that most of the time the Sil-chip correctly
> terminates after the transfer-length is reached, even if SGE_TRM is
> missing...

I have no idea either. We'll probably need a PCI bus tracer to tell
exactly what's going on.

Thanks.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/