nfs client readdir caching issue?

From: Andy Chittenden
Date: Wed Jul 02 2008 - 07:28:16 EST


Very rarely, we're seeing various problems on a linux kernel client
(seen on various versions) with ls on directories from an NFS server
that haven't changed:

* looping ls (strace -v shows getdents returning the same names over
again).
* duplicate directory entries.
* missing directory entries.

I've hunted google but can only see problems where NFS servers have
returned duplicate cookies. I've packet captured the readdirplus on one
of the directories and see no duplicate cookies. The problems remain
until the directory is touched, the NFS server is unmounted or some
other event happens (the data is flushed from the cache?).

I think we then got lucky and got two packet captures from different
clients running the same linux kernel. On these clients, the ls output
was ok - no loops, no duplicates, no missing entries. Both captures
showed two readdirplus requests returning the same entries in the same
order but the amount of data in the responses was different. One capture
showed the server returned 1724 bytes, 10 entries, last cookie of 12,
followed by the next readdirplus returning a length of 948 bytes, 5
entries, a first cookie value of 13. In the other capture, the responses
returned 2204 bytes, 13 entries, a last cookie of 17 and 468 bytes, 2
entries, a first cookie of 19.

In the past we've found that ls has returned duplicate entries on this
directory (but didn't have a capture at the time) and those duplicate
entries are the ones that are returned as the last 3 entries in the
first response of the second capture and the first 3 entries in the
second response of the first capture.

So what I think has happened in this particular case, is that at some
point in the past, the directory was read OK with packets similar to the
first capture. Next, the client decided to get rid of the first page of
cached readdir responses from memory for some reason (running low on
memory?) but kept the second page. Subsequently, the readdir cache needs
repopulating so the client sends a readdirplus specifying cookie of 0
and this time it gets a response which is similar to the first packet of
the second capture and thus we now have in cache duplicate names and
cookie values.

So is this possible? Is there some easy way to provoke it? Does this
mean the client's readdir cache is broken?

Please cc me on any response.

--
Andy, BlueArc Engineering


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/