Re: [RFC PATCH 0/2] dirreadahead system call

From: Andreas Dilger
Date: Mon Jul 28 2014 - 17:19:44 EST

On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@xxxxxxxxxx> wrote:
> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@xxxxxxxxx> wrote:
>> Is there a time when this doesn't get called to prefetch entries in
>> readdir() order? It isn't clear to me what benefit there is of returning
>> the entries to userspace instead of just doing the statahead implicitly
>> in the kernel?
>> The Lustre client has had what we call "statahead" for a while,
>> and similar to regular file readahead it detects the sequential access
>> pattern for readdir() + stat() in readdir() order (taking into account if
>> ".*"
>> entries are being processed or not) and starts fetching the inode
>> attributes asynchronously with a worker thread.
> Does this heuristic work well in practice? In the use case we were trying to
> address, a Samba server is aware beforehand if it is going to stat all the
> inodes in a directory.

Typically this works well for us, because this is done by the Lustre
client, so the statahead is hiding the network latency of the RPCs to
fetch attributes from the server. I imagine the same could be seen with
GFS2. I don't know if this approach would help very much for local
filesystems because the latency is low.

>> This syscall might be more useful if userspace called readdir() to get
>> the dirents and then passed the kernel the list of inode numbers
>> to prefetch before starting on the stat() calls. That way, userspace
>> could generate an arbitrary list of inodes (e.g. names matching a
>> regexp) and the kernel doesn't need to guess if every inode is needed.
> Were you thinking arbitrary inodes across the filesystem or just a subset
> from a directory? Arbitrary inodes may potentially throw up locking issues.

I was thinking about inodes returned from readdir(), but the syscall
would be much more useful if it could handle arbitrary inodes. For
example, if directories are small then it may be more efficient to
aggregate inodes from multiple directories for each prefetch syscall.
I can't really think of any locking issues that could exist with
"arbitrary list of inodes" that couldn't be hit by having a directory
with hard links to the same list of inodes, so this is something that
needs to be handled by the filesystem anyway.

Since this would be an advisory syscall (i.e. it doesn't really
return anything and cannot guarantee that all the inodes will be in
memory), then if the filesystem is having trouble prefetching the
inodes (e.g. invalid inode number(s) or lock ordering or contention
issues) it could always bail out and leave it to stat() to actually
fetch the inodes into memory when accessed.

There is no way it would be sane to keep inodes locked in the kernel
after prefetch, in case the "stat" never arrives, so the best it can
do is cache the inodes in memory (on the client for network fs), and
it is possible cache pressure or lock contention drops them from cache.

There are always going to be race conditions even if limited to a
single directory (e.g. another client modifies the inode after calling
dirreadahead(), but before calling stat()) that need to be handled.

I think there are a lot of benefits that could be had by the generic
syscall, possibly similar to what XFS is doing with the "bulkstat"
interfaces that Dave always mentions. This would be much more so for
cases were you don't want to stat all of the entries in a directory.

> But yeah, as Steve mentioned in a previous email, limiting the inodes
> readahead in some fashion other than a range in readdir() order is
> something that we are thinking of (list of inodes based on regexps,
> filenames etc). We just chose to do an offset range of the directory
> for a quick, early implementation.

>> As it stands, this syscall doesn't help in anything other than readdir
>> order (or of the directory is small enough to be handled in one
>> syscall), which could be handled by the kernel internally already,
>> and it may fetch a considerable number of extra inodes from
>> disk if not every inode needs to be touched.
> The need for this syscall came up from a specific use case - Samba.
> I'm told that Windows clients like to stat every file in a directory
> as soon as it is read in and this has been a slow operation.

Cheers, Andreas

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail