Re: A multi-threaded NFS server for Linux

Olaf Kirch (okir@monad.swb.de)
Tue, 26 Nov 1996 23:09:08 +0100


Hi all,

here are some ramblings about implementing nfsd, the differences
between kernel- and user-space, and life in general. It's become quite
long, so if you're not interested in either of these topics,
just skip it...

On Sun, 24 Nov 1996 12:01:01 PST, "H.J. Lu" wrote:
> With the upcoming the Linux C library 6.0, it is possible to
> implement a multi-threaded NFS server in the user space using
> the kernel-based pthread and MT-safe API included in libc 6.0.

In my opinion, servicing NFS from user space is an idea that should die.
The current unfsd (and I'm pretty sure this will hold for any other
implementation) has a host of problems:

1. Speed.

This is only partly related to nfsd being single-threaded. I have
run some benchmarks a while ago comparing my kernel-based nfsd to
the user-space nfsd.

In the unfsd case, I was running 4 daemons in parallel (which is possible
even now as long as you restrict yourself to read-only access), and
found the upper limit for peak throughput was around 800 KBps; the rate
for sustained reads was even lower. In comparison, the kernel-based
nfsd achieved around than 1.1 MBps peak throughput which is almost
the theoretical cheapernet limit; its sustained rate was around 1 MBps.
Testers of my recent knfsd implementation reported a sustained rate
of 3.8 MBps over 100 Mbps Ethernet.

Even though some tweaking of the unfsd source (especially by getting rid
of the Sun RPC code) may improve performance some more, I don't believe
the user-space can be pushed much further. [Speaking of the RPC library,
a rewrite would be required anyway to safely support NFS over TCP. You
can easily hang a vanilla RPC server by sending an incomplete request
over TCP and keeping the connection open]

Now add to that the synchronization overhead required to keep the file
handle cache in sync between the various threads...

This leads me straight to the next topic:

2. File Handle Layout

Traditional nfsds usually stuff a file's device and inode number into the
file handle, along with some information on the exported inode. Since
a user space program has no way of opening a file just given its inode
number, unfsd takes a different approach. It basically creates a hashed
version of the file's path. Each path component is stat'ed, and an 8bit
hash of the component's device and inode number is used.

The first problem is that this kind of file handle is not invariant
against renames from one directory to another. Agreed, this doesn't
happen too often, but it does break Unix semantics. Try this on an
nfs-mounted file system (with appropriate foo and bar):

(mv bar foo/bar; cat) < bar

The second problem is a lot worse. When unfsd is presented with a file
handle it does not have in its cache, it must map it to a valid path
name. This is basically done in the following way:

path = "/";
depth = 0;
while (depth < length(fhandle)) {
deeper:
dirp = opendir(path);
while ((entry = readdir(dirp)) != NULL) {
if (hash(dev,ino) matches fhandle component) {
remember dirp
append entry to path
depth++;
goto deeper;
}
}
closedir(dirp);
backtrack;
}

Needless to say, this is not very fast. The file handle cache helps
a lot here, but this kind of mapping operation occurs far more often
than one might expect (consider a development tree where files get
created and deleted continuously). In addition, the current implementation
discards conflicting handles when there's a hash collision.

This file handle layout also leaves little room for any additional
baggage. Unfsd currently uses 4 bytes for an inode hash of the file
itself and 28 bytes for the hashed path, but as soon as you add other
information like the inode generation number, you will sooner or
later run out of room.

Last not least, the file handle cache must be strictly synchronized
between different nfsd processes/threads. Suppose you rename foo to
bar, which is performed by thread1, then try to read the file, which is
performed by thread2. If the latter doesn't know the cached path is stale,
it will fail. You could of course retry every operation that fails with
ENOENT, but this will add even more clutter and overhead to the code.

3. Adherence to the NFSv2 specification

The Linux nfsd currently does not fulfill the NFSv2 spec in its entirety.
Especially when it comes to safe writes, it is really a fake. It neither
makes an attempt to sync file data before replying to the client (which
could be implemented, along with the `async' export option for turning
off this kind of behavior), nor does it sync meta-data after inode
operations (which is impossible from user space). To most people this
is no big loss, but this behavior is definitely not acceptable if you
want industry-strengh NFS.

But even if you did implement at least synchronous file writes in unfsd,
be it as an option or as the default, there seems to be no way to
implement some of the more advanced techniques like gathered writes.
When implementing gathered writes, the server tries to detect whether
other nfsd threads are writing to the file at the same time (which
frequently happens when the client's biods flush out the data on file
close), and if they do, it delays syncing file data for a few milliseconds
so the others can finish first, and then flushes all data in one go. You
can do this in kernel-land by watching inode->i_writecount, but you're
totally at a loss in user-space.

4. Supporting NFSv3

A user-space NFS server is not particularly well suited for implementing
NFSv3. For instance, NFSv3 tries to help cache consistency on the client
by providing pre-operation attributes for some operations, for instance
the WRITE call. When a client finds that the pre-operation attributes
returned by the server agree with those it has cached, it can safely
assume that any data it has cached was still valid when the server
replied to its call, so there's no need to discard the cached file data
and meta-data.

However, pre-op attributes can only be provided safely when the server
retains exclusive access to the inode throughout the operation. This is
impossible from user space.

A similar example is the exclusive create operation where a verifier
is stored in the inode's atime/mtime fields by the server to guarantee
exactly-once behavior even in the face of request retransmissions. These
values cannot be checked atomically by a user-space server.

What this boils down to is that a user-space server cannot, without
violating the protocol spec, implement many of the advanced features
of NFSv3.

5. File locking over NFS

Supporting lockd in user-space is close to impossible. I've tried it,
and have run into a large number of problems. Some of the highlights:

* lockd can provide only a limited number of locks at the same
time because it has only a limited number of file descriptors.

* When lockd blocks a client's lock request because of a lock held
by a local process on the server, it must continuously poll
/proc/locks to see whether the request could be granted. What's
more, if there's heavy contention for the file, it may take
a long time before it succeeds because it cannot add itself
to the inode's lock wait list in the kernel. That is, unless
you want it to create a new thread just for blocking on this
lock.

* Lockd must synchronize its file handle cache with that of
the NFS servers. Unfortunately, lockd is also needed when
running as an NFS client only, so you run into problems with
who owns the file handle cache, and how to share it between
these to services.

6. Conclusion

Alright, this has become rather long. Some of the problems I've described
above may be solvable with more or less effort, but I believe that, taken
as a whole, they make a pretty strong argument against sticking with
a user-space nfsd.

In kernel-space, most of these issues are addressed most easily, and more
efficiently. My current kernel nfsd is fairly small. Together with the
RPC core, which is used by both client and server, it takes up something
like 20 pages--don't quote me on the exact number. As mentioned above,
it is also pretty fast, and I hope I'll be able to also provide fully
functional file locking soon.

If you want to take a look at the current snapshot, it's available at
ftp.mathematik.th-darmstadt.de/pub/linux/okir/dontuse/linux-nfs-X.Y.tar.gz.
This version still has a bug in the nfsd readdir implementation, but
I'll release an updated (and fixed) version as soon as I have the necessary
lockd rewrite sorted out.

I would particularly welcome comments from Keepers of the Source whether
my NFS rewrite has any chance of being incorporated into the kernel at
some time... that would definitely motivate me to sick more time into
it than I currently do.

Happy hacking
Olaf

-- 
Olaf Kirch         |  --- o --- Nous sommes du soleil we love when we play
okir@monad.swb.de  |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
             For my PGP public key, finger okir@brewhq.swb.de.