Kernel EIP and total NFS failure (2.0.33, 2.1.86/9)

Nathan Field (nathan@Lanfear.ST.HMC.Edu)
Wed, 1 Apr 1998 14:09:30 -0800 (PST)


I have a very strange problem on a Beowulf cluster that I'm building.
Essentially NFS has decided to stop running on all of the nodes, as far as
I can tell at the same time. It will not restart, giving the following
error:
[root@n000 log]# /etc/rc.d/init.d/nfs start
Starting NFS services: rpc.mountd Cannot register service: RPC: Unable to
receive; errno = Connection refused
rpc.nfsd Cannot register service: RPC: Unable to receive; errno =
Connection refused

I noticed that one of the nodes had the following kernel EIP. It looks
like NFS died in the middle of some NFS operation:

Mar 29 02:02:05 n000 kernel: general protection: 0000
Mar 29 02:02:05 n000 kernel: CPU: 0
Mar 29 02:02:05 n000 kernel: EIP: 0010:[get_empty_inode+68/352]
Mar 29 02:02:05 n000 kernel: EFLAGS: 00010202
Mar 29 02:02:05 n000 kernel: eax: 00000c00 ebx: 031534ec ecx: 734191f8
edx
: 00000005
Mar 29 02:02:05 n000 kernel: esi: 000001f8 edi: 00000001 ebp: 001f1148
esp
: 0162fec8
Mar 29 02:02:05 n000 kernel: ds: 0018 es: 0018 fs: 002b gs: 002b
ss: 001
8
Mar 29 02:02:05 n000 kernel: Process gawk (pid: 5696, process nr: 27,
stackpage=
0162f000)
Mar 29 02:02:05 n000 kernel: Stack: 00000000 001e371c 00000000 00123661
0284bccc
020de118 0000000e 01c9d002
Mar 29 02:02:05 n000 kernel: 020de118 0015addd 001f1148 00014137
00000001
0284bccc 00000001 0162ff60
Mar 29 02:02:05 n000 kernel: 0000000e 02106530 00014137 0012a689
0284bccc
01c9d002 0000000e 0162ff60
Mar 29 02:02:05 n000 kernel: Call Trace: [__iget+97/516]
[ext2_lookup+341/364] [
lookup+221/244] [open_namei+516/1032] [do_open+87/284] [sys_open+57/112]
[system
_call+85/124]
Mar 29 02:02:05 n000 kernel: Code: 66 83 b9 80 00 00 00 00 75 26 ba e7 03
00 00
8a 81 84 00 00
Mar 29 02:02:05 n000 kernel: general protection: 0000
Mar 29 02:02:05 n000 kernel: CPU: 0
Mar 29 02:02:05 n000 kernel: EIP: 0010:[locks_remove_locks+7/56]
Mar 29 02:02:05 n000 kernel: EFLAGS: 00010286
Mar 29 02:02:05 n000 kernel: eax: f000ef6f ebx: 00000001 ecx: 00000000
edx
: 00000000
Mar 29 02:02:05 n000 kernel: esi: 00000000 edi: f000ef6f ebp: 00636018
esp
: 0162fe10
Mar 29 02:02:05 n000 kernel: ds: 0018 es: 0018 fs: 002b gs: 002b
ss: 001
8
Mar 29 02:02:05 n000 kernel: Process gawk (pid: 5696, process nr: 27,
stackpage=
0162f000)
Mar 29 02:02:05 n000 kernel: Stack: 001221cf 0063e018 00000000 00000001
00000003
00000001 001166f9 00000000
Mar 29 02:02:05 n000 kernel: 0000002b 00000014 01630000 0162fe8c
0010ad14
0000000b 001b56e4 00000000
Mar 29 02:02:05 n000 kernel: 000001f8 00000001 001f1148 00000000
05000000
04800000 02250018 0010b128
Mar 29 02:02:05 n000 kernel: Call Trace: [close_fp+55/92]
[do_exit+273/488] [die
_if_kernel+676/684] [3c59x+83886080/784039936] [3c59x+75497472/784039936]
[do_ge
neral_protection+40/84] [do_general_protection+0/84]
Mar 29 02:02:05 n000 kernel: [error_code+64/72]
[get_empty_inode+68/352]
[__iget+97/516] [ext2_lookup+341/364] [lookup+221/244]
[open_namei+516/1032] [do
_open+87/284] [sys_open+57/112]
Mar 29 02:02:05 n000 kernel: [system_call+85/124]
Mar 29 02:02:05 n000 kernel: Code: 8b 50 50 85 d2 74 27 f6 42 20 01 74 14
ff 74
24 04 83 c0 50

Upon bootup the nodes (running 2.0.33) report the following:
Apr 1 07:47:44 n000 kernel: loading device 'eth0'...
Apr 1 07:47:44 n000 kernel: eth0: 3Com 3c905 Boomerang 100baseTx at
0xd000, 00:
60:08:cb:be:11, IRQ 11
Apr 1 07:47:44 n000 kernel: 8K word-wide RAM 3:5 Rx:Tx split,
autoselect/MII
interface.
Apr 1 07:47:44 n000 kernel: eth0: MII transceiver found at address 24.
Apr 1 07:47:44 n000 kernel: eth0: Overriding PCI latency timer (CFLT)
setting o
f 32, new value is 248.
Apr 1 07:47:45 n000 mountd[259]: unable to register (mountd, 1, udp).
Apr 1 07:47:45 n000 exportfs[268]: syntax error in exports file (line 0):
bad o
ption list
Apr 1 07:47:45 n000 exportfs[268]: could not open /var/lib/nfs/xtab for
locking

Apr 1 07:47:45 n000 exportfs[268]: could not open /var/lib/nfs/xtab for
locking

Apr 1 07:47:45 n000 exportfs[268]: can't lock /var/lib/nfs/xtab for
writing
Apr 1 07:47:45 n000 mountd[289]: unable to register (mountd, 1, udp).
Apr 1 07:47:46 n000 nfsd[297]: unable to register (nfsd, 2, udp).
Apr 1 07:47:46 n000 automount[312]: starting automounter version 0.3.14,
path =
/data, maptype = file, mapname = /etc/autofs.map
Apr 1 07:47:46 n000 automount[312]: using kernel protocol version 3
[snip a bit, now trying to log in, attempts to automount home on hrothgar]
Apr 2 05:40:25 n000 automount[312]: attempting to mount entry
/data/hrothgar
Apr 2 05:40:25 n000 automount[445]: >> mount: RPC: Port mapper failure -
RPC: U
nable to receive
Apr 2 05:40:25 n000 automount[445]: mount(nfs): nfs: mount failure
hrothgar:/hrothgar on /data/hrothgar

We are using a 100MB Baystack switch (the only failure point that I can
see in this case, but I don't see how it could be, ping, finger, telnet
etc work just fine)
Hardware in the nodes is (running 2.0.33):
3com905
P II 300
128MB RAM
el cheapo PCI video card
4.2 IDE Quantum HD

Hardware in the frontend machine is (running 2.1.89 SMP):
2 Intel EtherExpress Pro 100s
2 P II 300s, (running on a Tyan Tiger board?)
256MB RAM
4.2 IDE Quantum HD
Matrox Mill II, 8MB
SB 64
DTP SCSI card (2144?, not currently used)

Just to dump even more info, here's something which the main node reports
in the logs:
Mar 7 16:12:24 hrothgar automount[12487]: using kernel protocol version 3
Mar 7 16:13:19 hrothgar automount[12487]: shutting down, path = /data
Mar 7 16:13:39 hrothgar automount[12501]: starting automounter version
0.3.14,
path = /data, maptype = file, mapname = /etc/autofs.map
Mar 7 16:13:39 hrothgar automount[12501]: using kernel protocol version 3
Mar 7 16:18:34 hrothgar nfsd[12554]: setsockopt failed: Invalid argument
Mar 7 16:18:34 hrothgar nfsd[12554]: setsockopt failed: Invalid argument

Thanks for reading through all of this, if anyone wants more info get in
touch with me and I'll be happy to provide it. My deadline for getting
the system operational is approaching and I'm in a bit of a bind :)

nathan

PS the logs mention that there was a problem writing to a file, xtab, this
is because /var/lib/nfs did not exsist. It didn't help to create the
file/dir, and it was working without this earlier.

Nathan Field -- Activities: Staff for CS Dept, building a Beowulf
machine for Math Dept. and passing my classes at Harvey
Mudd College in my free time.
For my PGP public key finger nathan@lanfear.st.hmc.edu

Printer not ready, could be a fatal error. Have a pen handy?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu