NFS question and NFS GPF PROBLEM (repost)

Michel LESPINASSE (walken@Studio.via.ecp.fr)
19 Jan 1997 22:57:16 GMT


I posted this to the list three days ago and got no response. Apparently
this mail wasn't distributed by vger..... arghhhh.
Well, I'm sorry if someone got this mail twice.
Also please cc: me in replyes : I'm reading the list, but apparently vger
doesn't wants to distribute all of it to me :-/

Okay, so here is this repost. The good news is that after writing this,
I removed the rsize=8192 parameter in my fstab and the kernel didn't gave
me oopses in three days.

-----------------------------------------

Hi,

Can someone explain me how req->rq_wait is initialised in the NFS
filesystem code ? The only place where I can see it beeing touched is
in nfsiod(), when the whole req structure is memset to zero.

>From then on, the only place where I can see rq_wait explicitely used
in the kernel (I looked with a big grep) is as arguments to wake_up or
interruptible_sleep_on....

Probably I'm just missing something <BIG>

In do_read_nfs_async, most of the req structure is initialized, but not
the rq_wait field. God I'm confused.

The reason I'm asking this is that I got two almost identical GPF's with
kernel versions 2.0.27 and 2.0.28. Each time I can see the problem in
the wake_up procedure, when it tryes to execute "struct task_struct *p =
next->task;" and next is not a valid pointer. In effect, in one of my
two dumps I can see next=0xffffffff at execution time. In the other dump
I have next=0x6e692f72 (looks like a random value ?)

In both of my GPF dumps, I can see that wake_up is beeing called by
nfsiod_enqueue. The only reason next can be bad in wake_up is because
req->rq_wait is bad in nfsiod_enqueue. But I cannot find where rq_wait
is initialised.

GPF dump with kernel 2.0.28 :

general protection: 0000
CPU: 0
EIP: 0010:[<00110ad4>]
EFLAGS: 00010286
eax: 01ffae28 ebx: ffffffff ecx: 01ffae28 edx: 0000030b
esi: 01ffae20 edi: 01ffae24 ebp: 012dfe64 esp: 012dfe58
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process dpkg-split (pid: 519, process nr: 44, stackpage=012df000)
Stack: 00000074 01ffae20 00730000 0020f980 00167119 01ffae28 00166ef1
01ffae20
019bc0f8 019bc0f8 01a86ec0 0020f980 00002400 00000000 00000000
001e2810
017c7058 017c7058 007e0000 00002000 01560a18 00000001 017c7b30
00000003
Call Trace: [<00167119>] [<00166ef1>] [<0011ba76>] [<0011bb4a>]
[<00166a08>] [<00121716>] [<0010a5f5>]
Code: 8b 13 8b 5b 04 85 d2 74 6d 8b 02 83 f8 02 74 07 8b 02 83 f8

today I got a very very similar dump with kernel 2.0.27 :

general protection: 0000
CPU: 0
EIP: 0010:[<001103e0>]
EFLAGS: 00010216
eax: 00010e28 ebx: 6e692f72 ecx: 00010e28 edx: 0000030b
esi: 00010e20 edi: 00010e24 ebp: 01368e64 esp: 01368e58
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process dpkg-split (pid: 3805, process nr: 40, stackpage=01368000)
Stack: 00000074 00010e20 014d0000 0023d9ac 00166a49 00010e28 00166821
00010e20
0118ac98 0118ac98 012fe7c0 0023d9ac 00002400 00000000 00000000
001e2068
0142e398 0142e398 01657000 00002000 0160f898 00000001 0142e9b0
00000003
Call Trace: [<00166a49>] [<00166821>] [<0011b382>] [<0011b456>]
[<00166338>] [<00120ff2>] [<0010a615>]
Code: 8b 13 8b 5b 04 85 d2 74 6d 8b 02 83 f8 02 74 07 8b 02 83 f8

In both cases the ksymoops dump is the same except for the hexadecimal
adresses (I show you the 2.0.28 version that I commented) :

Using `/boot/System.map-2.0.28' to map addresses to symbols.

>>EIP: 110ad4 <wake_up+2c/e4>
Trace: 167119 <nfsiod_enqueue+d/18>
Trace: 166ef1 <nfs_readpage+f5/264>
Trace: 11ba76 <generic_file_read+40e/5b4>
Trace: 11bb4a <generic_file_read+4e2/5b4>
Trace: 166a08 <nfs_file_read+a4/b0>
Trace: 121716 <sys_read+8a/b0>
Trace: 10a5f5 <system_call+55/80>

Code: 110ad4 <wake_up+2c/e4> movl (%ebx),%edx /* p=next->task */
Code: 110ad6 <wake_up+2e/e4> movl 0x4(%ebx),%ebx /* next=next->next */
Code: 110ad9 <wake_up+31/e4> testl %edx,%edx /* if (p!=NULL) */
Code: 110adb <wake_up+33/e4> je 110b4a <wake_up+a2/e4>
Code: 110add <wake_up+35/e4> movl (%edx),%eax /* p->state */
Code: 110adf <wake_up+37/e4> cmpl $0x2,%eax
Code: 110ae2 <wake_up+3a/e4> je 110aeb <wake_up+43/e4>
Code: 110ae4 <wake_up+3c/e4> movl (%edx),%eax /* state again !?! */
Code: 110ae6 <wake_up+3e/e4> cmpl $0x0,%eax
Code: 110ae9 <wake_up+41/e4> nop
Code: 110aea <wake_up+42/e4> nop
Code: 110aeb <wake_up+43/e4> nop

I think that there is a bug in the nfs filesystem code and that my
configuration is one of the few that triggers it. In practice, I can
usualy do a lot of nfs accesses until "someday" I get a GPF. After this
GPF, the results can be varying : sometimes the reading process gets
locked and I cannot kill -9 it (despite of the intr flag used at mount
time), sometimes the reading process gets killed. I also once got the
"wait_queue is bad" message while I was running the 2.0.27 kernel.
Sometimes my load average goes up to one, sometimes not.

I would like to determine with which kernel version this problem appeared,
but this is not easy because the problem usualy appears only after a few
days of uptime.

My configuration :
kernels 2.0.27 or 2.0.28
intel p133, 32M
3c509 adapter (3Com Etherlink III)
I also have a Token Ring card inside, but I wasn't using it when I got
this dumps.

My nfs partition is mounted with the following fstab entry :
bouddha:/debian /mnt/debian nfs
defaults,nodev,noexec,nosuid,rsize=8192,ro,intr 0 0

The nfs server is running linux 2.0.22 (but this shouldn't be related to
my oops in any case.... at least in theory :)

My .config follows (this version corresponds to my 2.0.28 kernel) :

# Automatically generated make config: don't edit

# Code maturity level options
# CONFIG_EXPERIMENTAL is not set

# Loadable module support
CONFIG_MODULES=y
CONFIG_MODVERSIONS=y
CONFIG_KERNELD=y

# General setup
# CONFIG_MATH_EMULATION is not set
CONFIG_NET=y
# CONFIG_MAX_16M is not set
CONFIG_PCI=y
CONFIG_SYSVIPC=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_ELF=y
CONFIG_KERNEL_ELF=y
# CONFIG_M386 is not set
# CONFIG_M486 is not set
CONFIG_M586=y
# CONFIG_M686 is not set

# Floppy, IDE, and other block devices
CONFIG_BLK_DEV_FD=y
CONFIG_BLK_DEV_IDE=y
# Please see Documentation/ide.txt for help/info on IDE drives
# CONFIG_BLK_DEV_HD_IDE is not set
# CONFIG_BLK_DEV_IDECD is not set
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDE_PCMCIA is not set
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_TRITON=y
# CONFIG_IDE_CHIPSETS is not set

# Additional Block Devices
# CONFIG_BLK_DEV_LOOP is not set
# CONFIG_BLK_DEV_MD is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_BLK_DEV_XD is not set
# CONFIG_BLK_DEV_HD is not set

# Networking options
# CONFIG_FIREWALL is not set
# CONFIG_NET_ALIAS is not set
CONFIG_INET=y
# CONFIG_IP_FORWARD is not set
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ACCT is not set
# (it is safe to leave these untouched)
# CONFIG_INET_PCTCP is not set
# CONFIG_INET_RARP is not set
# CONFIG_NO_PATH_MTU_DISCOVERY is not set
CONFIG_IP_NOSR=y
CONFIG_SKB_LARGE=y

# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_AX25 is not set
# CONFIG_NETLINK is not set

# SCSI support
# CONFIG_SCSI is not set

# Network device support
CONFIG_NETDEVICES=y
# CONFIG_DUMMY is not set
# CONFIG_EQUALIZER is not set
# CONFIG_PLIP is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_NET_RADIO is not set
CONFIG_NET_ETHERNET=y
CONFIG_NET_VENDOR_3COM=y
# CONFIG_EL1 is not set
# CONFIG_EL2 is not set
CONFIG_EL3=y
# CONFIG_VORTEX is not set
# CONFIG_LANCE is not set
# CONFIG_NET_VENDOR_SMC is not set
# CONFIG_NET_ISA is not set
# CONFIG_NET_EISA is not set
# CONFIG_NET_POCKET is not set
CONFIG_TR=y
CONFIG_IBMTR=y
# CONFIG_FDDI is not set
# CONFIG_ARCNET is not set

# ISDN subsystem
# CONFIG_ISDN is not set

# CD-ROM drivers (not for SCSI or IDE/ATAPI drives)
# CONFIG_CD_NO_IDESCSI is not set

# Filesystems
# CONFIG_QUOTA is not set
# CONFIG_LOCK_MANDATORY is not set
CONFIG_MINIX_FS=y
# CONFIG_EXT_FS is not set
CONFIG_EXT2_FS=y
# CONFIG_XIA_FS is not set
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
# CONFIG_VFAT_FS is not set
# CONFIG_UMSDOS_FS is not set
CONFIG_PROC_FS=y
CONFIG_NFS_FS=y
# CONFIG_ROOT_NFS is not set
CONFIG_SMB_FS=y
CONFIG_SMB_WIN95=y
# CONFIG_ISO9660_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set

# Character devices
CONFIG_SERIAL=y
# CONFIG_DIGI is not set
# CONFIG_CYCLADES is not set
# CONFIG_STALDRV is not set
# CONFIG_RISCOM8 is not set
# CONFIG_PRINTER is not set
# CONFIG_MOUSE is not set
# CONFIG_UMISC is not set
# CONFIG_QIC02_TAPE is not set
# CONFIG_FTAPE is not set
# CONFIG_APM is not set
# CONFIG_WATCHDOG is not set
# CONFIG_RTC is not set

# Sound
CONFIG_SOUND=y
# CONFIG_PAS is not set
CONFIG_SB=y
CONFIG_ADLIB=y
# CONFIG_GUS is not set
# CONFIG_MPU401 is not set
# CONFIG_UART6850 is not set
# CONFIG_PSS is not set
# CONFIG_GUS16 is not set
# CONFIG_GUSMAX is not set
# CONFIG_MSS is not set
# CONFIG_SSCAPE is not set
# CONFIG_TRIX is not set
# CONFIG_MAD16 is not set
# CONFIG_CS4232 is not set
# CONFIG_MAUI is not set
CONFIG_AUDIO=y
CONFIG_MIDI=y
CONFIG_YM3812=y
SBC_BASE=220
SBC_IRQ=10
SBC_DMA=3
SB_DMA2=7
SB_MPU_BASE=0
SB_MPU_IRQ=-1
DSP_BUFFSIZE=65536
# CONFIG_LOWLEVEL_SOUND is not set

# Kernel hacking
# CONFIG_PROFILE is not set

I hope that this problem will be solved fast, because currently I really
wonder what I could respond if someone asked me which kernel version I
consider to be stable.
(I'd hate to say something like "1.3.32 was stable for me")

Michel "Walken" LESPINASSE - Student at Ecole Centrale Paris (France)
www Email : walken@via.ecp.fr
(o o) VideoLan project : http://videolan.via.ecp.fr/
------oOO--(_)--OOo-------------------------------------------------------
Yow ! 1135 KB/s remote host TCP bandwidth over 10Mb/s ethernet. Beat that!