Re: 2.6.15.1: persistent nasty hang in sync_page killing NFS(ne2k-pci / DP83815-related?), i686/PIII

From: Trond Myklebust
Date: Mon Jan 30 2006 - 12:08:00 EST


On Mon, 2006-01-30 at 16:55 +0000, Nix wrote:
> On Sun, 29 Jan 2006, Trond Myklebust stipulated:
> > As a general rule of thumb: if tcpdump/ethereal can see the reply on the
> > client, then the engine socket should see it too. If tcpdump is indeed
> > seeing those replies, you should check the RPC code by
> > setting /proc/sys/sunrpc/rpc_debug to 1.
>
> tcpdump is seeing them.

The complete packets, including all fragments?


> ... have a pile of messages in the midst of a locked-up transfer:
>
> Jan 30 16:50:57 loki warning: kernel: -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit--
> Jan 30 16:50:57 loki warning: kernel: 15046 0006 0021 -00011 c1a11600 100003 c801f000 00000000 xprt_resend c02c3798 c01c0e4d
> Jan 30 16:50:57 loki warning: kernel: 15047 0006 0021 000000 c1a11600 100003 c801f0b8 00000070 xprt_pending c02c3869 c01c0e4d
> Jan 30 16:50:57 loki warning: kernel: 15048 0006 0021 -00011 c1a11600 100003 c801f170 00000000 xprt_resend c02c3798 c01c0e4d
> Jan 30 16:50:57 loki warning: kernel: 15049 0006 0001 -00011 c1a11600 100003 c801f228 00000000 xprt_sending c02c3798 c01c0e4d
> Jan 30 16:50:57 loki warning: kernel: RPC: 15047 xprt_timer
> Jan 30 16:50:57 loki warning: kernel: RPC: cong 256, cwnd was 256, now 256
> Jan 30 16:50:57 loki warning: kernel: RPC: 15048 xprt_cwnd_limited cong = 0 cwnd = 256
> Jan 30 16:50:57 loki warning: kernel: RPC: 15048 xprt_prepare_transmit
> Jan 30 16:50:57 loki warning: kernel: RPC: 15048 xprt_transmit(116)
> Jan 30 16:50:57 loki warning: kernel: RPC: 15048 xmit complete
> Jan 30 16:50:57 loki warning: kernel: RPC: 15047 xprt_prepare_transmit
> Jan 30 16:50:57 loki warning: kernel: RPC: 15047 xprt_cwnd_limited cong = 256 cwnd = 256
> Jan 30 16:50:57 loki warning: kernel: RPC: 15047 failed to lock transport c1a11800
> Jan 30 16:50:58 loki warning: kernel: RPC: 15048 xprt_timer
> Jan 30 16:50:58 loki warning: kernel: RPC: cong 256, cwnd was 256, now 256
> Jan 30 16:50:58 loki warning: kernel: RPC: 15046 xprt_cwnd_limited cong = 0 cwnd = 256
> Jan 30 16:50:58 loki warning: kernel: RPC: 15046 xprt_prepare_transmit
> Jan 30 16:50:58 loki warning: kernel: RPC: 15046 xprt_transmit(116)
> Jan 30 16:50:58 loki warning: kernel: RPC: 15046 xmit complete
> Jan 30 16:50:58 loki warning: kernel: RPC: 15048 xprt_prepare_transmit
> Jan 30 16:50:58 loki warning: kernel: RPC: 15048 xprt_cwnd_limited cong = 256 cwnd = 256
> Jan 30 16:50:58 loki warning: kernel: RPC: 15048 failed to lock transport c1a11800
> Jan 30 16:50:59 loki warning: kernel: RPC: 15046 xprt_timer
> Jan 30 16:50:59 loki warning: kernel: RPC: cong 256, cwnd was 256, now 256
> Jan 30 16:50:59 loki warning: kernel: RPC: 15047 xprt_cwnd_limited cong = 0 cwnd = 256
> Jan 30 16:50:59 loki warning: kernel: RPC: 15047 xprt_prepare_transmit
> Jan 30 16:50:59 loki warning: kernel: RPC: 15047 xprt_transmit(116)
> Jan 30 16:50:59 loki warning: kernel: RPC: 15047 xmit complete
> Jan 30 16:50:59 loki warning: kernel: RPC: 15046 xprt_prepare_transmit
> Jan 30 16:50:59 loki warning: kernel: RPC: 15046 xprt_cwnd_limited cong = 256 cwnd = 256
> Jan 30 16:50:59 loki warning: kernel: RPC: 15046 failed to lock transport c1a11800
> Jan 30 16:51:00 loki warning: kernel: RPC: 15047 xprt_timer
> [repeats indefinitely]
> Jan 30 16:51:38 loki warning: kernel: -pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit--
> Jan 30 16:51:38 loki warning: kernel: 15046 0006 0021 -00011 c1a11600 100003 c801f000 00000000 xprt_resend c02c3798 c01c0e4d
> Jan 30 16:51:38 loki warning: kernel: 15047 0006 0021 000000 c1a11600 100003 c801f0b8 00000140 xprt_pending c02c3869 c01c0e4d
> Jan 30 16:51:38 loki warning: kernel: 15048 0006 0021 -00011 c1a11600 100003 c801f170 00000000 xprt_resend c02c3798 c01c0e4d
>
> The RPC messages are emitted at pretty much exactly the same frequency
> as the ACKs.
>
> I *guess* that the `failed to lock transport' is the underlying error...
> time to add some debugging and find out what task is locking the
> transport. Back soon, must rebuild the kernel and reboot to clear this
> lock ;)

Nope. The congestion window is 1 request, and you do indeed appear to
have one request on the "pending" queue. The problem in the above trace
is a complete lack of data_ready messages, meaning that the socket is
never seeing any complete packets come in.

Cheers,
Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/