Re: 2.6.24: RPC: bad TCP reclen 0x00020090 (large)

From: Tom Tucker
Date: Mon Feb 18 2008 - 16:51:42 EST



On Mon, 2008-02-18 at 15:25 -0600, Tom Tucker wrote:
> On Mon, 2008-02-18 at 04:58 -0800, Andrew Morton wrote:
> > (suitable cc added)
> >
> > (regression)
> >
> > On Wed, 13 Feb 2008 17:02:53 +0300 Michael Tokarev <mjt@xxxxxxxxxx> wrote:
> >
> > > Hello!
> > >
> > > After upgrading to 2.6.24 (from .23), we're seeing ALOT
> > > of messages like in $subj in dmesg:
> > >
> > > Feb 13 13:21:39 paltus kernel: RPC: bad TCP reclen 0x00020090 (large)
> > > Feb 13 13:21:46 paltus kernel: printk: 3586 messages suppressed.
> > > Feb 13 13:21:46 paltus kernel: RPC: bad TCP reclen 0x00020090 (large)
> > > Feb 13 13:21:49 paltus kernel: printk: 371 messages suppressed.
> > > Feb 13 13:21:49 paltus kernel: RPC: bad TCP reclen 0x00020090 (large)
> > > Feb 13 13:21:55 paltus kernel: printk: 2979 messages suppressed.
> > > ...
> > >
> > > with linux NFS server. The clients are all linux too, mostly 2.6.23
> > > and some 2.6.22.
> > >
> > > I found the "offending" piece of code in net/sunrpc/svcsock.c,
> > > in routine svc_tcp_recvfrom() with condition being:
> > >
> > > if (svsk->sk_reclen > serv->sv_max_mesg) ...
>
> The problem might be that the client is setting a bit in the RPC message
> length field that is meant to be interpreted and masked off by the
> server -- and we're not doing it yet. My bet is that 0x20000 is the bit
> we're looking for. I'll poke around...

Never mind. The way this is supposed to work is that the transport is
shut down. The code used to delete the socket directly, but I
reorganized this code to just set the XPT_CLOSE bit and let the normal
close path handle it when it came back through to retry. I'm not sure
exactly what version of the code you have, but it may be that your
missing the code from the close path that does this, or it may just not
work. As a first shot, can you try this patch and tell me if the
messages go away?


diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 1d3e5fc..cf6150a 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -920,6 +920,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)

err_delete:
set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
+ svc_delete_xprt(&svsk->sk_xprt);
return -EAGAIN;

error:


>
> > >
> > > This happens after a server reboot. At this point, client(s) are trying
> > > to perform some NFS transaction and fail, and server starts generating
> > > the above messages - till I do a umount followed by mount on all clients.
> > > Before, such situation (nfs server reboot) were handled transparently,
> > > ie, there was nothing to do, the mount continued working just fine when
> > > the server comes back online.
> > >
> > > Now, I'm not sure if it's really 2.6.24-specific problem or a userspace
> > > problem. Some time ago we also upgraded nfs-kernel-server (Debian)
> > > package, and the remount-after-nfs-server-reboot problem started to
> > > occur at THAT time (and it is something to worry about as well, I just
> > > had no time to deal with it); but the dmesg spamming only appeared
> > > with 2.6.24.
> > >
> > > How to debug the issue further on from this point?
> > >
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/