Re: intermittent NFS hangs from NetApp

From: Garrick Staples (gstaples@interliant.com)
Date: Tue Feb 29 2000 - 14:39:53 EST


I was able to capture a tcpdump during the NFS hang today. Rather then
email ~6000 lines of logfiles to the list, you can grab it at
http://www.speculation.org/tcpdump_during_nfs_hang. The hang started at
14:00:21 and ended at 14:08:23 when I fixed it with the magical
incantation of umount -f. There is several minutes of logs before and
after the offending time.

When looking at the log, bear in mind that there are 4 other NFS mounts
from the same NetApp that don't lock up. During the hang I pinged the
NetApp and read and wrote files on the other mounts. The hanging mount
is mounted with "nolock", but the other 4 have just "defaults". tcpdump
command was ``tcpdump -vv -i eth2 -T rpc host 192.168.0.253''

/var/log/messages shows the same stuff I mentioned earlier:
Feb 29 14:00:33 dhp0020 kernel: nfs: server 192.168.0.253 not
responding, still trying
Feb 29 14:00:33 dhp0020 kernel: nfs: server 192.168.0.253 not
responding, still trying
Feb 29 14:00:35 dhp0020 kernel: nfs: task 39042 can't get a request slot
Feb 29 14:00:35 dhp0020 kernel: nfs: task 39048 can't get a request slot
Feb 29 14:01:15 dhp0020 kernel: nfs: task 39049 can't get a request slot
Feb 29 14:01:23 dhp0020 kernel: nfs: task 39050 can't get a request slot
Feb 29 14:06:20 dhp0020 kernel: nfs: task 39059 can't get a request slot
Feb 29 14:08:21 dhp0020 kernel: nfs_statfs: statfs error = 5
Feb 29 14:08:21 dhp0020 kernel: nfs_statfs: statfs error = 5

Again, kernel is 2.2.12-32smp #1 SMP, stock RedHat 6.1 as oemed by Dell.

Alan Cox wrote:
>
> > When the problem occurs, all processes that touch the mount are
> > indefinitly hung. Errors in messages show up:
> > Feb 24 18:08:04 dhp0020 kernel: nfs: server 192.168.0.253 not
> > responding, still trying
> > Feb 24 18:08:04 dhp0020 kernel: nfs: server 192.168.0.253 not
> > responding, still trying
>
> As far as its concerned the netapp isnt talking
>
> > The problem is immediately solved with a simple umount -f; which fails
> > because of the current processes, but it fixes the hang! When I do
> > umount -f, all of the waiting processes get a failed read, but they
> > continue normally.
>
> That suggests a wakeup got missed somewhere. It doesnt fit the netapp
> not talking. Either the netapp is losing a consistent request or it is
> the nfs client in the kernel. Both are possible, only doing some network
> dumps (tcpdump with -l 1514 and asked to decode NFS frames) done when it
> starts hanging would tell

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Feb 29 2000 - 21:00:23 EST