RE: L68K: Re: IDE-Driver Update :: Testing Requested

Geert Uytterhoeven (Geert.Uytterhoeven@cs.kuleuven.ac.be)
Tue, 25 Aug 1998 16:18:15 +0200 (CEST)


[ Moving the discussion to linux-kernel.

Recapitulate: when copying from hda to hdb, I get `irq timeout', followed by
a cascade of `ide0: reset: success' / `irq timeout' messages and a complete
system lock up. I'm not talking about large data streams, a few MB is
sufficient.

Latest findings: a timer bug? ]

On Mon, 24 Aug 1998, Michael Schmitz wrote:
> >I'm doing the following in ide_ack_intr():
> >
> > - check if the interrupt was for the IDE interface (Amiga has shared
> > interrupts)
> > - printk("*") to show when an IDE interrupt arrives
> > - wait until the IDE device is no longer busy (cfr. code above). This loop
> > is never entered.
>
> When copying from hda to hdb, I get
>
> >| ...
> >| ******hdb: irq timeout: status=0xd0 { Busy }
> >| ide0: reset: success
> >| hdb: irq timeout: status=0xd0* { Busy }
> ^
> How is this possible ?? Interrupt from IDE while in timeout processing?
>
> >| ide0: reset: success
> >| hdb: irq timeout: status=0xd0* { Busy }
> >| end_request: I/O error, dev 03:42 (hdb), sector 2
> >| hdb: irq timeout: status=0xd0 { Busy }
> >| ide0: reset: success
> >| hdb: irq timeout: status=0xd0* { Busy }
> >| ide0: reset: success
> >| hdb: irq timeout: status=0xd0* { Busy }
> >| ...
>
> Seems like it's getting the interrupt right after the timeout is reported in
> two out of three times.
>
> >The first `hdb: irq timeout' appears within a fraction of a second after the
> >last `*'. HOW IS THAT POSSIBLE?!?
>
> Dunno - for which disk was the interrupt (print 'a' or 'b' there)?

Changed. On IDE interrupt, I now print

'X' if hwgroup->drive == NULL
'A' for hda
'B' for hdb
'!' for other drive (should not happen)

> >The only cases when ide_ack_intr() prints a `*' and the timeout wouldn't be
> >reset, are:
> >
> > - unexpected interrupt (but I don't see any messages about this)
>
> Is unexpected_intr verbose? Probably ... but only if the drive status isn't
> OK (or whatever the OK_STAT(stat, READY_STAT, BAD_STAT) means).

Added `U' for unexpected interrupt. `R' if !drive_is_ready().

In set_handler(), I now print `x', `a', `b' or `?' to indicate for which drive
the request was started.

Now I get:

| aAaAaAaAbBbBbBbBbBbBbBbBbBbBbBbBbBbBbBbBbBaAbBbBbBaAbBbBbBbBbBbBbBaAaAaAaA
| bBbBbBbBbBaAaAaAaAaAaAaAaAbhdb: irq timeout: status=0xd0 { Busy }
| bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
| bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
| bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
| bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
| bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
| bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
| bbbbbbbbbbbbbbbbbbbbbbide0: reset: success
| bhdb: irq timeout: status=0xd0BU { Busy }
| bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

So the request times out on hdb, which is indeed the active drive. But the
message `hdb: irq timeout' appears immediately after the `b' from the last
request, while the timeout should be 10 seconds (10*HZ).

Let's verify: print the timeout value that was active, and wait in
ide_timer_expiry() until the drive is ready:

| aAaAaAbBbBbBbBbBbBaAahda: irq timeout: status=0xd0 { Busy }
| timeout was 1000
| waiting for not busy: AUide0: unexpected interrupt, status=0x58, count=1
^^
| .ready
| bhdb: irq timeout: status=0xd0 { Busy }
| timeout was 1000
| waiting for not busy: BU.ready

So the timeout was 1000 jiffies (10 seconds). Soon the real interrupt happens
(`A'), which is of course unexpected because ide_timer_expiry() resets
hwgroup->handler = NULL again.

Add more debugging code, to remember when the last request was posted. I don't
have the log here (forgot to save it after reboot[*]), but here are the facts:

- at jiffies = 18779, set_handler() calls add_timer() with an expiration
time of 19779 (i.e. jiffies+10*HZ)
- ide_timer_expiry() is called at jiffies = 18779 (i.e. immediately), while
hwgroup->timer.expires still contains 19779 (i.e. future)

So ide_timer_expiry() is called before the timer expires.

How is this possible?? It's driving me nuts, after 1.5 day of debugging,
fscking, ...

I'm using Andre's latest IDE patch for 2.1.117, but the relevant code didn't
change. Besides, we've been seeing the symptoms since months (years?).
The rest of my kernel is Linux/m68k 2.1.115.

Could it be m68k related?

BTW, I also found a possible NULL pointer dereference bug:

In drivers/block/ide.c:ide_error()

err = ide_dump_status(drive, msg, stat);
if (drive == NULL || (rq = HWGROUP(drive)->rq) == NULL)
return;

but ide_dump_status() dumps drive->name without checking for drive being a NULL
pointer.

Greetings,

Geert

[*] Yes, on Amiga you say `debug=mem' and the last 128K of kernel messages are
stored in a piece of unused Chip RAM. I like it! :-)

--
Geert Uytterhoeven                     Geert.Uytterhoeven@cs.kuleuven.ac.be
Wavelets, Linux/{m68k~Amiga,PPC~CHRP}  http://www.cs.kuleuven.ac.be/~geert/
Department of Computer Science -- Katholieke Universiteit Leuven -- Belgium

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html