Repeated Oops with 2.2.14 & knfsd

From: Jens Benecke (jens@pinguin.conetix.de)
Date: Tue Feb 22 2000 - 11:12:48 EST


Hi everyone,

before all else: How do I really thoroughly test for faulty memory? I am
beginning to suspect it's the hardware, NO Linux system has ever been as
unstable over so many (>20) different kernel revisions as my own.

This is my old P100 (Asus mainboard, 48MB RAM, 10G harddisk) that I use as
NFS server for /home and /usr/local. The knfsd part has been patched to
circumvent the mandatory access problem for KDE applications, like this:

# fs/nfsd/vfs.c
#define IS_ISMNDLK(i) (IS_MANDLOCK((i)) && \
        (((i)->i_mode & (S_ISGID|S_IXGRP|S_IFMT)) == (S_ISGID|S_IFREG)))

This is the system:

        Linux pinguin 2.2.14 #2 Sam Jan 29 21:05:21 CET 2000 i586 unknown
        Kernel modules 2.3.9
        Gnu C 2.95.2
        Binutils 2.9.5.0.22
        Linux C Library 2.1.3
        Dynamic linker ldd: version 1.9.11
        Procps 2.0.6
        Mount 2.10f
        Net-tools 2.05
        Kbd 0.99
        Sh-utils 2.0
        Modules Loaded ip_masq_vdolive ip_masq_user ip_masq_quake
        ip_masq_irc ip_masq_raudio ip_masq_ftp nfsd parport_probe parport_pc lp
        parport lockd sunrpc 3c509 ne2k-pci 8390 old_tulip serial unix

I had some problems with the new tulip driver losing connections after time
of inactivity, so I went for the old_tulip. Newly installed, it went
without problems for 14 days, then came the first Oops. This apparently
happened with a noHUP'ed wget task that I left running during night.

kernel: Oops: 0000
kernel: CPU: 0
kernel: EIP: 0010:[find_buffer+106/140]
kernel: EFLAGS: 00010206
kernel: eax: 00200000 ebx: 00000004 ecx: 0000bab4 edx: 00200000
kernel: esi: 0000000a edi: 00000307 ebp: 00249201 esp: c1fd9da0
kernel: ds: 0018 es: 0018 ss: 0018
kernel: Process wget (pid: 18195, process nr: 121, stackpage=c1fd9000)
kernel: Stack: 00000400 00000307 0000bab4 c012539f 00000307 00249201 00000400 c01256dc
kernel: 00000307 00249201 00000400 c22f4800 c1d5ac00 c22f4800 c1fd9e40 c013905e
kernel: c013916c 00000307 00249201 00000400 00249201 c1fd9f18 c0789320 00000307
kernel: Call Trace: [get_hash_table+23/36] [getblk+32/328] [ext2_new_block+1626/2276] [ext2_new_block+1896/2276]
        [get_hash_table+23/36] [ext2_alloc_block+328/340] [block_getblk+323/624]
kernel: [ext2_getblk+361/516] [ext2_file_write+562/1424] [sock_recvmsg+62/176] [sock_read+132/148] [sys_write+191/224] [system_call+52/56]
kernel: Code: 8b 12 39 68 04 75 f3 8b 4c 24 20 39 48 08 75 ea 66 39 78 0c

This made my box stop every started task (I could see that because in the
morning I had a hundred "fetchmail", "cron" and "run-parts" tasks in "D" or
"SW" state), and nothing really worked any more. I rebooted, and

kernel: Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
kernel: Oops: 0000
kernel: CPU: 0
kernel: EIP: 0010:[find_buffer+106/140]
kernel: EFLAGS: 00010206
kernel: eax: 00200000 ebx: 00000004 ecx: 0000bab4 edx: 00200000
kernel: esi: 0000000a edi: 00000307 ebp: 0030808b esp: c0c11cfc
kernel: ds: 0018 es: 0018 ss: 0018
kernel: Process quotacheck (pid: 284, process nr: 37, stackpage=c0c11000)
kernel: Stack: 00000400 00000307 0000bab4 c012539f 00000307 0030808b 00000400 c01256dc
kernel: 00000307 0030808b 00000400 00000006 c0c11e9c 00000000 00000000 00000307
kernel: c012858f 00000307 0030808b 00000400 c0b1a4e0 ffffffea 00000000 00002000
kernel: Call Trace: [get_hash_table+23/36] [getblk+32/328]
        [block_read+659/1196] [add_request+276/624] [make_request+1386/1416]
        [make_request+543/1416] [do_anonymous_page+103/116]
kernel: [do_no_page+48/192] [handle_mm_fault+193/304]
        [fn_hash_lookup+129/204] [get_fast_time+12/16] [netif_rx+20/160]
        [3c509:__insmod_3c509_O/lib/modules/2.2.14/net/3c509.o_M38934E33_V+-17919/76]
        [do_anonymous_page+103/116] [do_no_page+48/192]
kernel: [update_process_times+91/100] [timer_bh+204/880]
        [sys_llseek+155/268] [sys_read+174/196] [system_call+52/56]
kernel: Code: 8b 12 39 68 04 75 f3 8b 4c 24 20 39 48 08 75 ea 66 39 78 0c

which I didn't realize until when I tried to re-mount my NFS:

kernel: Oops: 0000
kernel: CPU: 0
kernel: EIP: 0010:[find_buffer+106/140]
kernel: EFLAGS: 00010206
kernel: eax: 00200000 ebx: 00000004 ecx: 0000bab4 edx: 00200000
kernel: esi: 0000000a edi: 00000307 ebp: 00288387 esp: c1053e8c
kernel: ds: 0018 es: 0018 ss: 0018
kernel: Process nfsd (pid: 273, process nr: 43, stackpage=c1053000)
kernel: Stack: 00288387 c15de988 0000bab4 c013f9f1 00000307 00288387 00000400 00000029
kernel: 00000100 c0f8f860 c0868880 00000002 00000100 0000000a 0028837d 00000000
kernel: 00000062 c013fc3c c0868880 00002a0c c1601ca4 c0f8f860 00000000 c0868950
kernel: Call Trace: [trunc_indirect+381/672] [trunc_dindirect+296/344]
        [ext2_truncate+255/452] [old_tulip:tulip_debug+192517/43712379]
        [old_tulip:tulip_debug+192607/43712289]
        [old_tulip:tulip_debug+183967/43720929]
        [old_tulip:tulip_debug+219896/43685000]
kernel: [old_tulip:tulip_debug+182287/43722609]
        [old_tulip:tulip_debug+219896/43685000]
        [old_tulip:tulip_debug+63063/43841833]
        [old_tulip:tulip_debug+219748/43685148]
        [old_tulip:tulip_debug+181906/43722990] [kernel_thread+40/56]
kernel: Code: 8b 12 39 68 04 75 f3 8b 4c 24 20 39 48 08 75 ea 66 39 78 0c

Most of the other programs kept running, though. I was able to ssh in and
reboot, although umounting didn't work, it stuck before that.

I rebooted, re-mounted my NFS, and on first access

kernel: Oops: 0000
kernel: CPU: 0
kernel: EIP: 0010:[find_buffer+106/140]
kernel: EFLAGS: 00010206
kernel: eax: 00200000 ebx: 00000004 ecx: 0000bab4 edx: 00200000
kernel: esi: 0000000a edi: 00000307 ebp: 00288387 esp: c108fe8c
kernel: ds: 0018 es: 0018 ss: 0018
kernel: Process nfsd (pid: 253, process nr: 42, stackpage=c108f000)
kernel: Stack: 00288387 c2b89588 0000bab4 c013f9f1 00000307 00288387 00000400 00000029
kernel: 00000100 c052ce60 c2856ee0 00000002 00000100 00000000 00000000 00000000
kernel: 00000062 c013fc3c c2856ee0 00002a0c c2b890a4 c052ce60 00000000 c2856fb0
kernel: Call Trace: [trunc_indirect+381/672] [trunc_dindirect+296/344]
        [ext2_truncate+255/452] [old_tulip:tulip_debug+192517/43712379]
        [old_tulip:tulip_debug+192607/43712289]
        [old_tlip:tulip_debug+183967/43720929]
        [old_tulip:tulip_debug+219896/43685000]
kernel: [old_tulip:tulip_debug+182287/43722609]
        [old_tulip:tulip_debug+219896/43685000]
        [old_tulip:tulip_debug+63063/43841833]
        [old_tulip:tulip_debug+219748/43685148]
        [old_tulip:tulip_debug+181906/43722990] [kernel_thread+40/56]
kernel: Code: 8b 12 39 68 04 75 f3 8b 4c 24 20 39 48 08 75 ea 66 39 78 0c

Now I inserted the new tulip and for the last 15 minutes it's been running
smooth. But I don't think it will last another 14 days.

I am totally lost. Except for buggy hardware, what could it be - and if
it's the hardware (I suspect RAM if anything), how would I test? the BIOS
doesn't complain, even if I do the "long" RAM test.

Thanks for any help!

-- 
_ciao, Jens_______________________________ http://www.pinguin.conetix.de
"The current logic behind why we do not have a html to text converter [in
Outlook] is the overhead that would be placed on the machine, browser and
email app that would seriously hinder performance."
  -- Microsoft, on "Why it is not possible to disable HTML in Outlook for
     security reasons"

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Feb 23 2000 - 21:00:31 EST