Re: strange ext3 corruption problem on 2.6.x

From: Thorild Selen
Date: Mon Mar 15 2004 - 18:31:03 EST


Marc Lehmann <pcg@xxxxxxxxxx> writes:

> On Mon, Mar 15, 2004 at 08:59:29AM +1030, jpearson@xxxxxxxxxxxxxxxxxxx wrote:
>> 'r/o' by the RAID layer, presumably unbeknownst to VFS; are you
>> *sure* that your array is still up and 'good' when you get this
>> message?
>
> As I said, there are no other messages, so if there is a problem (cabling,
> disk-i/o etc.), then the kernel doesn't know it either (usually the kernel
> it quite loud in this condition).

I was able to repeat this (although with a somewhat different error
message) on a Xeon machine (HT, so "almost" SMP) running 2.6.3 (with
some IPv6 and NFS related patches, most likely nothing affecting
LVM/md/ext3). It took a few hours running bonnie++ on an ext3 fs on
LVM atop raid5 (four Hitachi SATA disks on a Promise SATA150 TX4
controller) until the machine got problems.

The kernel log says:

Mar 15 06:34:40 Psilocybe kernel: EXT3-fs error (device dm-2):
ext3_readdir: bad entry in directory #11: rec_len %% 4 != 0 - off
set=0, inode=1061109567, rec_len=16191, name_len=63
Mar 15 06:34:40 Psilocybe kernel: Aborting journal on device dm-2.
Mar 15 06:34:40 Psilocybe kernel: EXT3-fs error (device dm-2) in
ext3_ordered_writepage: IO failure
Mar 15 06:34:40 Psilocybe kernel: EXT3-fs error (device dm-2) in
ext3_ordered_writepage: IO failure
Mar 15 06:34:41 Psilocybe kernel: ext3_abort called.
Mar 15 06:34:41 Psilocybe kernel: EXT3-fs abort (device dm-2):
ext3_journal_start: Detected aborted journal
Mar 15 06:34:41 Psilocybe kernel: Remounting filesystem read-only
Mar 15 06:34:42 Psilocybe kernel: EXT3-fs error (device dm-2) in
start_transaction: Journal has aborted

And the last words from bonnie++ were:

Writing intelligently...Can't write block.
Bonnie: drastic I/O error (write(2)): No such file or directory

Then bonnie exited. It seems like something unrelated to this fs was
in an inconsistent state at this stage, as the machine crashed some
hours later.

The last syslog entry before the crash was at 11:05:01, then the
machine crashed quietly. The console was blank; the machine still
responded to pings, but appeared otherwise dead. The arrays were not
clean and were reconstructed at boot, also arrays that were not
involved in running the benchmark.

I can try to repeat the experiment with another fs if that is desired,
but people seem to agree already that the problem is in ext3. Any
suggestions on how to continue?


Thorild Selén
Datorföreningen Update / Update Computer Club, Uppsala, SE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/