More NFS questions

Brad Loomis (bloomis@nda.com)
Fri, 21 Nov 1997 13:24:27 -0500


Thanks to all the helped me out earlier in the week. I do really appreciate
it. However I have some more questions I would like some input on.

<----------------------------------------------------------------------------->
Again my background:

29 Linux compute servers (in a batch queue) running Redhat 4.1 with Kernel
2.0.32 on SCSI drives. 256M of RAM with 1GB of swap. All identical with
3comm Boomerang NIC's (100Mbs). We are also using amd and here is the entry
in the /etc/amd.conf for the nfs server options that we push out to the
compute clients.

nfs-server/*
opts:=rw,hard,intr,noquota,grpid,timeo=40,retrans=0,rsize=8192,wsize=8192

2 Linux NFS servers running Redhat 4.1 with kernel 2.0.31 (have not had time
to upgrade on the servers to 2.0.32), again 3comm Boomerang NIC's with Full
Duplex. These are data stores for the compute servers to mount, read in the
data and push the corresponding results back out.

1. Cabletron MMAC Plus with Fast Ethernet Cards connecting this all together
in one big flat net.
<----------------------------------------------------------------------------->

This is a fairly intensive NFS setup. Multiple compute clients (5-10)
requesting several hundred megabyte files to read in, process and write back
out 600meg to 1 gig of data. While this is going on several other jobs on
the queue are outputing small amounts of log data (< 300k) to a nfs
partition about the status of the compute. The nfs server versions is
"Universal NFS server 2.2beta16".

After the kernel upgrade on the clients I have seen the following errors
across this compute cluster.

1. Several nfs_stat_to_errno: bad nfs status return value: 110 and 116
However the mounts recover and everything is fine. What could cause this
error??

2. One nfs_rpc_verify: RPC call failed: 5, which killed a mount and caused a
job that was 14 hours in to fail. What could be the possible cause of this??
In search some of the archives someone else noted it was an issue with a
Full duplex driver and buffering issues with 2.0.28.
Has this been fixed in 2.0.32??

3. After a while nfsd and mountd on each of the nfs servers hang all of the
compute clients. Only a kill and restart of nfsd/mountd fixes this. I can
cron this to restart every hour or so but I am just wondering if it is a
known bug?? Some of you have said it was..

4. Am I pushing the current nfsd implementation to far?? I have seen a
2.2beta29 out there and would migrate to it if applicable. Just would like
any input.

5. For a high NFS setup is it better to stay with the 2.0.32 kernel or take
my chances with the 2.1.XX kernel. Again any input/opinions appreciated.

As always thanx for your time,

-brad

-------------------------------------------------------------------
Brad Loomis E-Mail:bloomis@nda.com
Network Engineer Phone:781-937-3338 ext329
Net Daemons Assoc., 800 W. Cummings Park Drive,#2050 Woburn MA 01801
"It is kind of fun to do the impossible --Walt Disney"