Re: IDE DMA errors, massive disk corruption: Why? Fixed Yet? Why not re-do failed op?

From: Bartlomiej Zolnierkiewicz
Date: Mon Oct 06 2003 - 14:10:02 EST



There are different IDE DMA errors.
Please post error, dmesg and .config.

On Monday 06 of October 2003 20:42, Daniel B. wrote:
> I just got bitten _again_ by IDE DMA timeout errors and massive
> filesystem corruption in kernel 2.4.22 (on an Asus A7M266-D dual-Athlon
> XP motherboard (AMD 768 chip / amd7441 IDE controller)).
>
> (I had turned DMA off in my init scripts, but apparently Debian
> unstable's k7-smp configuration enables DMA by default before my init
> scripts get control. Ext3 journal "recovery" trashed my system
> partition.)
>
> What's going on with the IDE DMA bugs? They have existed since 2.2
> (right?), and even at .22 in the 2.4 series they still exist. Why
> have they been around so long? Is it that few kernel developers use
> the combinations of hardware or configuration options that expose
> the bugs (like my dual-CPU box with IDE, not SCSI, disks)?

Well, yes, I have no problems for example :-).

> Are the DMA bugs believed to be fixed (for real) yet? IF so, in which
> version?
>
> Is there any consolidated documentation of the combinations of factors
> that cause corruption, or of how to reliably avoid corruption (like
> all the things to check to make sure your kernel never even tries to
> enable DMA)?
>
>
> Also, why does a DMA timeout cause such corruption? Doesn't the kernel
> keep track of uncompleted operations, retain the information needed to
> try again, and try again if there's a failure? If not, why not?
>
> If it can't try again, shouldn't the kernel at least abort after one
> disk-write failure instead of performing additional writes, which
> frequently depend on the previous writes? (E.g., if I try to read
> block 1's data and write it to block 2, and then write something new
> to block 1, if the first write fails but continue and do the second
> write, data gets destroyed. If the first write fails and I stop right
> away, less is destroyed.)

Are you sure you don't have faulty drive?

--bartlomiej

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/