Re: 2.4.20 NFS server lock-up (SMP)

From: Daniel Forrest (forrest@lmcg.wisc.edu)
Date: Thu Jan 30 2003 - 02:00:27 EST


Yenya,

>> I have a problem on Linux 2.4.20 with NFS server - my NFS server
>> from time to time (currently about once a day) stops responding to
>> NFS requests. Apart from that the system is OK, I can log in via
>> SSH, and I can run "/sbin/reboot -n -f" to reboot it. Also
>> filesystem operations seem to be OK, even Samba server is
>> responding. When this happens, I see all nfsd processes and lockd
>> process to be stuck in the "D" state.

My guess is that this is related to a bug in garbage collection in
lockd. The breakpoint is when you pass 32 unique clients. If you are
able to recompile the kernel, try this patch:

A deadlock occurs under the following sequence:

->nlmsvc_lock calls down(&file->f_sema)
 ->nlmsvc_create_block
  ->nlmclnt_lookup_host
   ->nlm_lookup_host may do garbage collection
    ->nlm_gc_hosts
     ->nlmsvc_mark_resources
      ->nlm_traverse_files action = NLM_ACT_MARK
       ->nlm_inspect_file loops over all files
        ->nlmsvc_traverse_blocks calls down(&file->f_sema)

This is a patch against 2.5.53, but it should also apply to any 2.4 or
2.5 tree since the code is virtually identical.

--- fs/lockd/svclock.c.ORIG Mon Dec 23 23:19:52 2002
+++ fs/lockd/svclock.c Mon Dec 30 13:42:10 2002
@@ -176,8 +176,14 @@
         struct nlm_rqst *call;
 
         /* Create host handle for callback */
+ /* We must up the semaphore in case the host lookup does
+ * garbage collection (which calls nlmsvc_traverse_blocks),
+ * but this shouldn't be a problem because nlmsvc_lock has
+ * to retry the lock after this anyway */
+ up(&file->f_sema);
         host = nlmclnt_lookup_host(&rqstp->rq_addr,
                                 rqstp->rq_prot, rqstp->rq_vers);
+ down(&file->f_sema);
         if (host == NULL)
                 return NULL;
 
I have tried repeatedly to get this patch into the kernel, but it
hasn't made it yet. If this does solve your problem, let me know and
I will try one more time to get it accepted.

-- 
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Jan 31 2003 - 22:00:23 EST