Re: Deadlock in NFSv4 in all kernels

From: Trond Myklebust
Date: Tue May 25 2010 - 10:04:43 EST


On Tue, 2010-05-25 at 09:45 -0400, William A. (Andy) Adamson wrote:
> 2010/5/7 Lukas Hejtmanek <xhejtman@xxxxxxxxxxx>:
> > Hi,
> >
> > I encountered the following problem. We use short expiration time for
> > kerberos contexts created by rpc.gssd (some patches were included in mainline
> > nfs-utils). In particular, we use 120secs expiration time.
> >
> > Now, I run application that eats 80% of available RAM. Then I run 10 parallel
> > dd processes that write data into NFS4 volume with sec=krb5.
> >
> > As soon as the kerberos context expires (i.e., up to 120 secs), the whole
> > system gets stuck in do_page_fault and succesive functions. It is because
> > there is no free memory in kernel, all free memory is used as cache for NFS4
> > (due to dd traffic), kernel ask NFS to write back its pages but NFS cannot do
> > anything as it is missing valid context. NFS contacts rpc.gssd to provide
> > a renewed context, the rpc.gssd does not provide the context as it needs some memory
> > to scan /tmp for a ticket. I.e., it deadlocks.
> >
> > Longer context expiration time is no real solution as it only makes the
> > deadlock less often.
> >
> > Any ideas what can be done here?
>
> Not get into the problem in the first place: this means
>
> 1) determine a 'lead time' where the NFS client declares a context
> expired even though it really as 'lead time' until it actually
> expires.
>
> 2) flush all writes on any contex that will expire within the lead
> time which needs to be long enough for flushes to take place.

That too is only a partial solution. The GSS context can expire early
due to totally unforeseeable circumstances such as a server reboot, for
instance.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/