Hello, Linus, Jeff.
On 03/10/2010 07:12 AM, Jeff Garzik wrote:Coincedentally, it looks like someone else just reported the same
problem, with 2.6.34-rc1.
It definitely sounds like a race. READ DMA is a DMA command as the name
implies, so that eliminates the possibility of polling-related paths in
ata_sff_interrupt (libata-sff.c).
I'll flip some of my machines to the icky slow boring piix mode, rather
than sexy AHCI mode :) to see if I can reproduce. I have had a feeling
that we needed a more sophisticated IRQ handling setup, this may be what
was needed. Lost interrupt recovery should occur faster than 30 seconds
in any case, and should not require a hard reset if the hardware
functions just fine outside of the lost-interrupt / race that just
occurred.
Yeap, there is a race condition with clearing which I don't think we
can solve completely but with some modification I think we can at
least cover known failure cases.
For longer term, I don't think we can solve this by diddling with the
SFF registers. The interface is just way too ancient and horrid to
build anything reliable on top of. I'm planning on implementing
smarter IRQ storm handling and stepped timeouts for ATA commands.