I think reading the IDE status register clears the interrupt in the IDE
device, which might be causing the drive to think it's OK to generate
another interrupt.
This could either cause it to get stuck trying to
service an interrupt that is never getting cleared as you suggested, or
possibly when the next IRQ comes in the IDE IRQ handler gets stuck
waiting for a spinlock that the code you're looking at already owns...?
Perhaps a printk in the IDE IRQ handler would be informative? It
wouldn't help you figure out how it got where it is, but it might help
you figure out why the system is hanging.
Stuart
-----Original Message-----
From: linux-ide-owner@xxxxxxxxxxxxxxx
[mailto:linux-ide-owner@xxxxxxxxxxxxxxx] On Behalf Of Linas Vepstas
Sent: Monday, June 18, 2007 12:57 PM
To: linux-ide@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
Subject: [BUG] ide dma_timer_expiry, then hard lockup
I've got a hard lockup in the ide subsystem, probably due to some irq
spew or something like that.
I've just bought a brand new Maxtor 320GB disk driver for the insane
price of $70 US to replace another failing drive. It works well under
light load; I was able to copy about 60GB to it. However, under heavy
load, such as reconstruction of an MD
RAID-1 array, it'll lock up the kernel. Which means that my system
won't boot :-(
I'm running 2.6.21.1, although the problem seems to occur in 2.6.19 and
2.6.18 too; its been there a while; I vageuly remember similar problems
in 2.6.5 or 2.6.10.
I get an
"hdc: dma_timer_expiry: dma status == 0x21"
and 10 seconds later,
"hdc: DMA Timeout error"
at which point the system is locked up hard.
Magic sysreq does not work at all. The hard drive activity light stays
fully lit. Inserting printk's into the kernel, I find the hang to be in
a surprising place:
ide_dma_timeout_retry() in ide-io.c prints the "hdc: DMA Timeout error" then calls
HWIF(drive)->ide_dma_end(drive);
which returns, and then calls hwif->INB(IDE_STATUS_REG) which is needed as an argument to
ide_error()
But this hangs! -- The INB never returns.
Now: hwif->INB = ide_inb; in ide-iops.c
So putting a printk into ide_inb() shows that
the printk before the readb() is printed, and the
printk after the readb is not (!!)
I find this rather surpriseing, as I can't imagine how the
readb can fail. My current vague theory is that doing this
readb makes the hard drive go really nuts, and it probably
ties some interrupt line high, and so the linux kernel gets stuck trying to handle the irq flood. I just don't know
enough about the i386 architecture, or about interrupts, to prove or disprove this.
Any suggestions, experiments, experimental patches, data gathering,
etc. is welcome. The sooner, the better...
--linas