Re: [2.6.26-rc4] mount.nfsv4/memory poisoning issues...

From: Chuck Lever
Date: Mon Jun 16 2008 - 12:18:40 EST

Next message: Sean Young: "Re: Regression: boot failure on AMD Elan TS-5500"
Previous message: Miquel van Smoorenburg: "Re: XFS internal error xfs_trans_cancel at line 1163 of filefs/xfs/xfs_trans.c"
Next in thread: Jeff Layton: "Re: [2.6.26-rc4] mount.nfsv4/memory poisoning issues..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Daniel-

On Jun 15, 2008, at 2:10 PM, Daniel J Blueman wrote:

On Thu, Jun 5, 2008 at 12:43 AM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
Hi Daniel-

On Wed, Jun 4, 2008 at 7:33 PM, Daniel J Blueman
<daniel.blueman@xxxxxxxxx> wrote:

Having experienced 'mount.nfs4: internal error' when mounting nfsv4 in
the past, I have a minimal test-case I sometimes run:

$ while :; do mount -t nfs4 filer:/store /store; umount /store; done

After ~100 iterations, I saw the 'mount.nfs4: internal error',
followed by symptoms of memory corruption [1], a locking issue with
the reporting [2] and another (related?) memory-corruption issue
(off-by-1?) [3]. A little analysis shows memory being overwritten by
(likely) a poison value, which gets complicated if it's not
use-after-free...

Anyone dare confirm this issue? NFSv4 server is x86-64 Ubuntu 8.04
2.6.24-18, client U8.04 2.6.26-rc4; batteries included [4].

We have some other reports of late model kernels with memory
corruption issues during NFS mount. The problem is that by the time
these canaries start singing, the evidence of what did the corrupting
is long gone.

I'm happy to decode addresses, test patches etc.

If these crashes are more or less reliably reproduced, it would be
helpful if you could do a 'git bisect' on the client to figure out at
what point in the kernel revision history this problem was introduced.

Have you seen the problem on client kernels earlier than 2.6.25?

Firstly, I had omitted that I'd booted the kernel with
debug_objects=1, which provides the canary here.

The primary failure I see is 'mount.nfs4: internal error', and always
after 358 umount/mount cycles (plus 1 initial mount) which gives us a
clue; 'netstat' shows all these connections in a TIME_WAIT state, thus
the bug relates to the inability to allocate a socket error path. I
found that after the connection lifetime expired, you can mount again,
which corroborates this theory.

In this case, we saw the mount() syscall result in the mount.nfsv4
process being SEGV'd when booted with 'debug_object=1', without this
option, we see:

# strace /sbin/mount.nfs4 x1:/ /store
...
mount("x1:/", "/store", "nfs4", 0,
"addr=192.168.0.250,clientaddr=19"...) = -1 EIO (Input/output error)

So, it's impossible to tell when the corruption was introduced, as it
has only become detectable recently.

It's worth a look-over of the socket-allocation error path, if someone
can check, and reproduces 100% with the 'debug_object=1' param,
available since 2.6.26-rc1 and 359 mounts in quick succession.

That's nicely specific.

I'm juggling several other problems at the moment, but I will try to get back to this again in a day or two.

There's a lot of new code in the NFS mount path. "internal error" means the mount command has encountered something entirely unexpected. So when anyone sees this message, please go ahead and report it on the linux-nfs@xxxxxxxxxxxxxxx list. Thanks!

I will see if I can make that message a little more explicit.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Sean Young: "Re: Regression: boot failure on AMD Elan TS-5500"
Previous message: Miquel van Smoorenburg: "Re: XFS internal error xfs_trans_cancel at line 1163 of filefs/xfs/xfs_trans.c"
Next in thread: Jeff Layton: "Re: [2.6.26-rc4] mount.nfsv4/memory poisoning issues..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]