Re: no DRQ after issuing WRITE was Re: 2.4.23-uv3 patch set released

From: Daniel Tram Lux
Date: Sat Jan 03 2004 - 06:24:47 EST


Rob Love wrote:

On Tue, 2003-12-30 at 17:54, Linus Torvalds wrote:



Interrupts are _not_ disabled here, very much on purpose. If they were, then "jiffies" wouldn't update, and the timeouts wouldn't work.

This is what that _stupid_ "local_irq_set()" function does: it saves the old irq masking state, and then it enables it.

The whole concept doesn't make any sense. If you enable interrupts, there is little point in saving the callers irq mask, since it already got deflated.



Ah, OK. local_irq_set() is worthless, then.

Curious to see the results of upping the timeout.

Rob Love



I tried setting the timeout up as a first fix, it also decreased the frequency of the error,
however it did not get rid of the error.
I used:

#define WAIT_DRQ (10*HZ/100) /* 100msec - spec allows up to 20ms */

in stead of:

#define WAIT_DRQ (5*HZ/100) /* 50msec - spec allows up to 20ms */


The device the error occurs with is a cf card. The error also occurs much more frequently in
2.4.23 than in 2.4.20 (but it can be provoked in 2.4.20). Neither use the preemption patch
and both are from kernel.org. The platform is based on an AMD Elan processor which is
a 486 compatible processor, running at 133 Mhz. The IDE subsytem does not use any extra
drivers and is not a PCI ide chipset.

The test I use to provoke the error is moving a directory tree from hdc (a normal harddisk)
to hda (the cf card), removing the dir on hdc, copy it back from hda to hdc, and remove it
from hda, then start all over.....
While doing this there is a flood ping running and the machine is being flood pinged + there
is traffic on three serial ports (RS485).

The way the code works right now there is no way you can tell how much time has passed
since the status register last got read out due to a possible interrupt. So when I made the patch
I saw two possibilities, either disabeling the interrupts while first reading the status and then
checking the timeout, after which the interrupts would be enabled again.
Or to just make one extra check after the timout has expired because that is cheaper
than returning, failing and then resetting the drive. After I applied my patch (using the
5*HZ/100 timeout) my test ran for a full weekend without giving the timeout error.
Before the error would occur about every 3 minutes with 2.4.23 and every couple of
hours on 2.4.20. (I didn't try to patch 2.4.20).

The ide standard gives a timeout for the busy wait of 20 ms which should not be exceeded
and the documentation from sandisk (the cf card is from sandisk) claims to conform to this.

If anybody has any other suggestions/tests I can try these out on monday when I am back
at work.

Regards

Daniel Tram Lux


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/