PROBLEM: Filesystem corruption with 2.4.1-pre3 and raid5

From: Holger Kiehl (Holger.Kiehl@dwd.de)
Date: Sun Jan 14 2001 - 17:07:04 EST


Hello

Doing some test where lots of small files get copied (and some large
ones) around, I experienced filesystem corruption with 2.4.1-pre3.

The system has a ASUS P2B-DS (onboard adaptec controller) with two P2-350,
256MB (one module) PC-100 222 SDRAM with ECC, with 4 SCSI disk and one IDE
disk put together as one big SW Raid5 disk, SuSE 6.4 with the following:
    Linux cube 2.4.1-pre3 #3 SMP Sun Jan 14 14:19:02 CET 2001 i686 unknown
    Kernel modules 2.3.24
    Gnu C 2.95.2
    Gnu Make 3.78.1
    Binutils 2.9.5.0.24
    Linux C Library x 1 root root 4061504 Mar 11 2000 /lib/libc.so.6
    Dynamic linker ldd (GNU libc) 2.1.3
    Procps 2.0.6
    Mount 2.10r
    Net-tools 1.54
    Kbd 0.99
    Sh-utils 2.0
    Modules Loaded

I know my modutilities are not up to date, but all relevant things (SCSI,
filesystem, raid) where compiled in.
Here are some messages from syslog:

    Jan 14 18:50:00 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: Deleting nonexistent file (613512), 0
    Jan 14 18:56:19 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: Deleting nonexistent file (613533), 0
    Jan 14 18:56:20 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: Deleting nonexistent file (613510), 0
    Jan 14 18:57:14 cube kernel: attempt to access beyond end of device
    Jan 14 18:57:14 cube kernel: 09:01: rw=1, want=1753106892, limit=8449536
    Jan 14 18:57:14 cube kernel: attempt to access beyond end of device
    Jan 14 18:57:14 cube kernel: 09:01: rw=1, want=1635361196, limit=8449536
        .
        .
        .
    Jan 14 18:57:14 cube kernel: attempt to access beyond end of device
    Jan 14 18:57:14 cube kernel: 09:01: rw=1, want=127799040, limit=8449536
    Jan 14 18:57:14 cube kernel: attempt to access beyond end of device
    Jan 14 18:57:14 cube kernel: 09:01: rw=1, want=1004451972, limit=8449536
    Jan 14 19:09:05 cube -- MARK --
    Jan 14 19:29:05 cube -- MARK --
    Jan 14 19:32:55 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: Deleting nonexistent file (145947), 0
    Jan 14 19:32:55 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: Deleting nonexistent file (145948), 0
    Jan 14 19:32:55 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: Deleting nonexistent file (145949), 0
        .
        .
        .
    Jan 14 19:33:18 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: Deleting nonexistent file (145945), 0
    Jan 14 19:33:18 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: Deleting nonexistent file (145946), 0
    Jan 14 19:49:06 cube -- MARK --
    Jan 14 19:53:36 cube kernel: __alloc_pages: 2-order allocation failed.
    Jan 14 19:53:39 cube last message repeated 8 times
    Jan 14 20:09:06 cube -- MARK --
    Jan 14 20:10:52 cube kernel: EXT2-fs error (device md(9,1)): ext2_readdir: bad entry in directory #929061: rec_len is smaller than minimal - offset=4056, inode=0, rec_len=0, name_len=0
    Jan 14 20:10:52 cube kernel: EXT2-fs error (device md(9,1)): empty_dir: bad entry in directory #929061: rec_len is smaller than minimal - offset=4056, inode=0, rec_len=0, name_len=0
    Jan 14 20:30:20 cube -- MARK --
    Jan 14 20:50:24 cube -- MARK --
    Jan 14 21:10:06 cube kernel: EXT2-fs error (device md(9,1)): ext2_free_blocks: bit already cleared for block 1402395
    Jan 14 21:10:06 cube kernel: EXT2-fs error (device md(9,1)): ext2_free_blocks: bit already cleared for block 1438368
    Jan 14 21:11:57 cube kernel: EXT2-fs error (device md(9,1)): ext2_free_blocks: bit already cleared for block 1439021
    Jan 14 21:11:57 cube kernel: EXT2-fs error (device md(9,1)): ext2_free_blocks: bit already cleared for block 1435690
    Jan 14 21:27:01 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: Deleting nonexistent file (698429), 0
        .
        .
        .
    Jan 14 21:27:03 cube kernel: EXT2-fs warning (device md(9,1)): ext2_unlink: Deleting nonexistent file (698429), 0
    Jan 14 21:30:02 cube nscd: 175: cannot stat() file `/etc/group': No such file or directory
    Jan 14 21:35:38 cube /usr/sbin/gpm[113]: oops() invoked from gpm.c(508)
    Jan 14 21:35:38 cube /usr/sbin/gpm[113]: get_shift_state: Inappropriate ioctl for device

At this point I could still log into the system.
I noticed after killing all process with SysRQ+i that something (I assume
the kernel) was eating my memory:

    ps aux

    USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
    root 1 0.0 0.0 344 200 ? S 14:48 0:09 init
    root 2 0.0 0.0 0 0 ? SW 14:48 0:00 [keventd]
    root 4 0.0 0.0 0 0 ? SW 14:48 0:23 [kswapd]
    root 5 0.0 0.0 0 0 ? SW 14:48 0:03 [kreclaimd]
    root 6 0.7 0.0 0 0 ? SW 14:48 2:59 [bdflush]
    root 7 0.3 0.0 0 0 ? SW 14:48 1:19 [kupdate]
    root 8 0.0 0.0 0 0 ? SW< 14:48 0:00 [mdrecoveryd]
    root 9 2.2 0.0 0 0 ? SW< 14:48 9:16 [raid5d]
    root 10 0.0 0.0 0 0 ? SW< 14:48 0:00 [raid1d]
    root 11847 0.0 0.2 1160 524 tty6 S 21:38 0:00 /sbin/mingetty tty6
    root 11848 0.0 0.2 1160 528 tty5 S 21:38 0:00 /sbin/mingetty tty5
    root 11854 0.0 0.2 1160 524 tty3 S 21:38 0:00 /sbin/mingetty tty3
    root 11855 0.0 0.2 1160 524 tty4 S 21:38 0:00 /sbin/mingetty tty4
    root 11856 0.0 0.4 1804 1112 tty1 S 21:38 0:00 login -- root
    root 11857 0.0 0.2 1160 524 tty2 S 21:38 0:00 /sbin/mingetty tty2
    root 11858 0.2 0.5 2164 1316 tty1 S 21:39 0:00 -bash
    root 11867 0.0 0.4 2760 1156 tty1 R 21:39 0:00 ps aux

    free

    total used free shared buffers cached
    Mem: 255284 226000 29284 0 17828 117788
    -/+ buffers/cache: 90384 164900
    Swap: 267228 540 266688

One ps even just dumped core and I still have the core file. Don't know if
this is of help:

    afdbench@cube:~$ gdb /bin/ps core.ps
    GNU gdb 4.18
    Copyright 1998 Free Software Foundation, Inc.
    GDB is free software, covered by the GNU General Public License, and you are
    welcome to change it and/or distribute copies of it under certain conditions.
    Type "show copying" to see the conditions.
    There is absolutely no warranty for GDB. Type "show warranty" for details.
    This GDB was configured as "i386-suse-linux"...(no debugging symbols found)...
    Core was generated by `ps x'.
    Program terminated with signal 11, Segmentation fault.
    Reading symbols from /lib/libc.so.6...done.
    Reading symbols from /lib/ld-linux.so.2...done.
    #0 0x80525ac in strcpy () at ../sysdeps/generic/strcpy.c:43
    43 ../sysdeps/generic/strcpy.c: No such file or directory.
    (gdb) where
    #0 0x80525ac in strcpy () at ../sysdeps/generic/strcpy.c:43
    #1 0x81490e0 in __ctype_b ()
    #2 0x8052f28 in strcpy () at ../sysdeps/generic/strcpy.c:43
    #3 0x8050472 in strcpy () at ../sysdeps/generic/strcpy.c:43
    #4 0x80509a5 in strcpy () at ../sysdeps/generic/strcpy.c:43
    #5 0x40034a5e in __libc_start_main () at ../sysdeps/generic/libc-start.c:93

I have this system now for several years and it has always been very
stabel under 2.2.x. In fact this is my first filesystem corruption.

If I forget anything or more information is required please tell me.

Holger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jan 15 2001 - 21:00:39 EST