Re: knfsd and system crashes

Olaf Kirch (okir@monad.swb.de)
Fri, 14 Nov 1997 11:47:18 +0100


On Thu, Nov 13, 1997 at 12:43:16PM +0100, Andi Kleen wrote:
> AFAIK the unfsd uses a hash over the inode numbers of every pathname
> component. When it can't find the filehandle in it's internal caches
> it has to walk the directory tree to find the path again. That was done
> because there is no way to open by inode in userspace (basically the same
> problem that knfsd is facing currently. But in-kernel it can be fixed ,)

That's correct. The unfsd file handle looks like this:

4 bytes hash of (inode, dev)
1 byte n = length of path starting at /
n bytes 8bit hash of (inode, dev) of pathname component
rest padded with 0 bytes

While this file handle is invariant across system reboots, it has other
problems. Probably the worst is that it is not invariant against renames.
Try this on an NFS-mounted partition:

mkdir zappa
(mv zappa/frank .; echo "You lose") > zappa/frank

and you should get 'stale file handle', or, on 2.0.30, an empty file.
(I guess there's a problem with error propagation in the flush-on-close
stuff in NFS). This problem also affects the dentry-stuff file handle
layout, BTW.

The second problem is of course performance. When presented with a file
handle that's not in its internal cache, unfsd has to reconstruct the
file path from the handle. The algorithm goes something like this:

Start with directory /. Do a readdir, check the 8bit hash of each component
against the corresponding byte in the file handle. If it matches, descend
into directory. If directory exhausted, back up one level. If we found
a directory that matches the entire hash path in the file handle, look
for an entry whose 32bit hash matches that given in the file handle.

I think it doesn't need explaining why we wouldn't want such a thing in
the kernel nfsd.

Another problem is that unfsd uses a file handle cache at all. I don't
recall what the exact size is, but regardless how large you make it,
for any real work it will always be too small and start thrashing. Combine
that with the above algorithm and you quickly start wondering why people
don't complain about it any louder than they currently are.

Basically, the _only_ way to do NFS correctly is to put dev/inode number
into the file handle and use them to retrieve the inode/dentry/whatever.
This is the way Sun's implementation works, and this is what the NFS spec
reflects. Read it, and you'll find that there's really no other way to
do it on a Unix box if you want to be 100% compliant.

Olaf

-- 
Olaf Kirch         |  --- o --- Nous sommes du soleil we love when we play
okir@monad.swb.de  |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
okir@caldera.de    +-------------------- Why Not?! -----------------------