Re: [RFC] readdirplus implementations: xgetdents vs dirreadahead syscalls

From: Dave Chinner
Date: Wed Jul 30 2014 - 23:16:20 EST

On Mon, Jul 28, 2014 at 03:21:20PM -0600, Andreas Dilger wrote:
> On Jul 25, 2014, at 6:38 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Fri, Jul 25, 2014 at 10:52:57AM -0700, Zach Brown wrote:
> >> On Fri, Jul 25, 2014 at 01:37:19PM -0400, Abhijith Das wrote:
> >>> Hi all,
> >>>
> >>> The topic of a readdirplus-like syscall had come up for discussion at last year's
> >>> LSF/MM collab summit. I wrote a couple of syscalls with their GFS2 implementations
> >>> to get at a directory's entries as well as stat() info on the individual inodes.
> >>> I'm presenting these patches and some early test results on a single-node GFS2
> >>> filesystem.
> >>>
> >>> 1. dirreadahead() - This patchset is very simple compared to the xgetdents() system
> >>> call below and scales very well for large directories in GFS2. dirreadahead() is
> >>> designed to be called prior to getdents+stat operations.
> >>
> >> Hmm. Have you tried plumbing these read-ahead calls in under the normal
> >> getdents() syscalls?
> >
> > The issue is not directory block readahead (which some filesystems
> > like XFS already have), but issuing inode readahead during the
> > getdents() syscall.
> >
> > It's the semi-random, interleaved inode IO that is being optimised
> > here (i.e. queued, ordered, issued, cached), not the directory
> > blocks themselves.
> Sure.
> > As such, why does this need to be done in the
> > kernel? This can all be done in userspace, and even hidden within
> > the readdir() or ftw/ntfw() implementations themselves so it's OS,
> > kernel and filesystem independent......
> That assumes sorting by inode number maps to sorting by disk order.
> That isn't always true.

That's true, but it's a fair bet that roughly ascending inode number
ordering is going to be better than random ordering for most

Besides, ordering isn't the real problem - the real problem is the
latency caused by having to do the inode IO synchronously one stat()
at a time. Just multithread the damn thing in userspace so the
stat()s can be done asynchronously and hence be more optimally
ordered by the IO scheduler and completed before the application
blocks on the IO.

It doesn't even need completion synchronisation - the stat()
issued by the application will block until the async stat()
completes the process of bringing the inode into the kernel cache...


Dave Chinner
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at