Re: NFS still has caching problems

Olaf Kirch (ok@daveg.com)
Thu, 18 Jul 1996 13:45:23 +0200


Hi all,

Firstly, I believe that the issue we're talking about is pathological in
a way, because NFS was never designed to provide cache consistency. These
problems can only be addressed properly in the context of a different
protocol, or maybe NFS file locking. Invalidating cached data based on
the file's mtime as soon as it arrives is not the way to go in NFSv2, IMHO.

First, let's assume you invalidate the cache unconditionally as soon as
the server's mtime changes. So who's going to be bitten by this?
Firstly, applications that access files in a read-write-read-write
pattern. Another problem occurs with the attributes returned by a read
NFS call; if we invalidate the cache in this case, we will not only
throw away the old cached pages, but also those just read, which *are*
uptodate. Now multiply that by 4 because of the way we currently do
readahead, and the result is not pretty.

Now, assume we try to be clever and cheat when receiving the server's
attributes from an operation that we know will change the server's
mtime. Linus already mentioned the race window that exists here: you
cannot assume that your cache is still valid just because _your_
operation changed only the file's meta data; an intervening operation
from another client could have modified the file contents. This is not
splitting hair; remember the whole problem is about someone else writing
to the file while we're accessing it. Besides, we're not talking about
revalidating data for the duration of just acregmax; once we've updated
NFS_OLDMTIME, future calls to revalidate_inode will not throw away _any_
cached page until the file is changed again. NFSv3 tries to eliminate
this problem by providing the pre- and post-op attributes in the NFS
reply.

The best solution I can see is to change the attribute caching (which
is also what BSD does, BTW). Currently, revalidate_inode does not do
a getattr if the last revalidation took place less than acregmax
jiffies ago. This can be changed, using heuristics along the following
lines:

* Add a new field to nfs_inode called acvalid, and initialize
it with acregmax (or acdirmax in the case of directories).

* Modify nfs_revalidate_inode in the following way:

if (jiffies - NFS_READTIME(inode) < NFS_ACVALID(inode))
return;
if (getattr(inode, &fattr) == 0) {
nfs_refresh_inode(inode, &fattr);
if (NFS_OLDMTIME == fattr.mtime.seconds) {
NFS_ACVALID = MIN(NFS_ACVALID << 1, server->acregmax);
return;
}
NFS_OLDMTIME = fattr.mtime.seconds;
}
invalidate_inode_pages(inode);

* If the attributes returned by a getattr or read call indicate
that the file's mtime has changed, set inode->acvalid to
MAX(inode->acvalid>>1, server->acregmin). There's a small
gotcha wrt. to readahead here, which could actually slash
acvalid by 16 rather than 2 because 4 concurrent read operations
halve acvalid before the next call to revalidate_inode. But
who has acregmax >= 16 * acregmin anyway?

These are just suggestions; comments welcome. I will also look into the
BSD code to see how they do it.

Cheers
Olaf

PS: Side note to Alex: Linux does not always track the server's mtime
in inode->i_mtime; utimes() will set it to the client's time. This is
actually a flaw in NFSv2 (in NFSv3 you can set the inode's time fields
to server time).