Re: [Cluster-devel] [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

From: Steven Whitehouse
Date: Fri Jul 25 2014 - 16:02:49 EST


Hi,

On 25/07/14 19:28, Zach Brown wrote:
On Fri, Jul 25, 2014 at 07:08:12PM +0100, Steven Whitehouse wrote:
Hi,

On 25/07/14 18:52, Zach Brown wrote:
[snip]
Hmm. Have you tried plumbing these read-ahead calls in under the normal
getdents() syscalls?

We don't have a filereadahead() syscall and yet we somehow manage to
implement buffered file data read-ahead :).

- z

Well I'm not sure thats entirely true... we have readahead() and we also
have fadvise(FADV_WILLNEED) for that.
Sure, fair enough. It would have been more precise to say that buffered
file data readers see read-ahead without *having* to use a syscall.

doubt, but how would we tell getdents64() when we were going to read the
inodes, rather than just the file names?
How does transparent file read-ahead know how far to read-ahead, if at
all?
In the file readahead case it has some context, and thats stored in the struct file. Thats where the problem lies in this case, the struct file relates to the directory, and when we then call open, or stat or whatever on some file within that directory, we don't pass the directory's fd to that open call, so we don't have a context to use. We could possibly look through the open fds relating to the process that called open to see if the parent dir of the inode we are opening is in there, in order to find the context to figure out whether to do readahead or not, but...... its not very nice to say the least.

I'm very much in agreement that doing this automatically is best, but that only works when its possible to get a very good estimate of whether the readahead is needed or not. That is much easier for file data than it is for inodes in a directory. If someone can figure out how to get around this problem though, then that is certainly something we'd like to look at.

The problem gets even more tricky in case the user only wants, say, half of the inodes in the directory... how does the kernel know which half?

The idea here is really to give some idea of the kind of performance gains that we might see with the readahead vs xgetdents approaches, and by the sizes of the patches, the relative complexity of the implementations.

I think overall, the readahead approach is the more flexible... if I had a directory full of files I wanted to truncate for example, it would be possible to use the same readahead to pull in the inodes quickly and then issue the truncates to the pre-cached inodes. That is something that would not be possible using xgetdents. Whether thats useful for real world applications or not remains to be seen, but it does show that it can handle more potential use cases than xgetdents. Also the ability to only readahead an application specific subset of inodes is a useful feature.

There is certainly a discussion to be had about how to specify the inodes that are wanted. Using the directory position is a relatively easy way to do it, and works well when most of the inodes in a directory are wanted. Specifying the file names would work better when fewer inodes are wanted, but then if very few are required, is readahead likely to give much of a gain anyway?... so thats why we chose the approach that we did.

How do the file systems that implement directory read-ahead today deal
with this?
I don't know of one that does - or at least readahead of the directory info itself is one thing (which is relatively easy, and done by many file systems) its reading ahead the inodes within the directory which is more complex, and what we are talking about here.

Just playing devil's advocate here: It's not at all obvious that adding
more interfaces is necessary to get directory read-ahead working, given
our existing read-ahead implementations.

- z
Thats perfectly ok - we hoped to generate some discussion and they are good questions,

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/