Re: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks

From: Dave Chinner
Date: Fri May 15 2015 - 19:38:30 EST

On Thu, May 14, 2015 at 08:51:12AM -0700, Linus Torvalds wrote:
> On Thu, May 14, 2015 at 4:23 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > IIRC, ext4 readdir is not slow because of the use of the buffer
> > cache, it's slow because of the way it hashes dirents across blocks
> > on disk. i.e. it has locality issues, not a caching problem.
> No, you're just worrying about IO. Natural for a filesystem guy, but a
> lot of loads cache really well, and IO isn't an issue. Yes, there's a
> bad cold-cache case, but that's not when you get inode semaphore
> contention.

Right, because it's cold cache performance that everyone complains
about. e.g. Workloads like gluster, ceph, fileservers, openstack
(e.g. swift) etc are all mostly cold cache directory workloads with
*extremely high* concurrency. Nobody is complaining about cached
readdir performance - concurrency in cold cache directory operations
is what everyone has been asking me for.

In case you missed it, recently the Ceph developers have been
talking about storing file handles in a userspace database and then
using open_by_handle_at() so they can avoid the pain of cold cache
directory lookup overhead (see the O_NOMTIME thread). We have a
serious cold cache lookup problem on directories when people are
looking to bypass the directory structure entirely....

[snip a bunch of rhetoric lacking in technical merit]

> End result: readdir() wastes a *lot* of time on stupid stuff (just
> that physical block number lookup is generally more expensive than
> readdir itself should be), and it does so with excessive locking,
> serializing everything.

The most overhead in readdir is calling filldir over and over again
for every dirent to copy it into the user buffer. The overhead is
not from looking up the buffer in the cache.

So, I just created close to a million dirents in a directory, and
ran the xfs_io readdir command on it (look, a readdir performance
measurement tool!). I used a ram disk to take IO out of the picture
for the first read, the system has E5-4620 0 @ 2.20GHz CPUs, and I
dropped caches to ensure that there was no cached metadata:

$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
$ sudo xfs_io -c readdir /mnt/scratch
read 29545648 bytes from offset 0
28 MiB, 923327 ops, 0.0000 sec (111.011 MiB/sec and 3637694.9201 ops/sec)
$ sudo xfs_io -c readdir /mnt/scratch
read 29545648 bytes from offset 0
28 MiB, 923327 ops, 0.0000 sec (189.864 MiB/sec and 6221628.5056 ops/sec)
$ sudo xfs_io -c readdir /mnt/scratch
read 29545648 bytes from offset 0
28 MiB, 923327 ops, 0.0000 sec (190.156 MiB/sec and 6231201.6629 ops/sec)

Reading, decoding and copying dirents at 190MB/s? That's roughly 6
million dirents/second being pulled from cache, and it's doing
roughly 4 million/second cold cache. That's not slow at all.

What *noticable* performance gains are there to be had here for the
average user? Anything that takes less than a second or two to
complete is not going to be noticable to a user, and most people
don't have 8-10 million inodes in a directory....

So, what did the profile look like?

10.07% [kernel] [k] __xfs_dir3_data_check
9.92% [kernel] [k] copy_user_generic_string
7.44% [kernel] [k] xfs_dir_ino_validate
6.83% [kernel] [k] filldir
5.43% [kernel] [k] xfs_dir2_leaf_getdents
4.56% [kernel] [k] kallsyms_expand_symbol.constprop.1
4.38% [kernel] [k] _raw_spin_unlock_irqrestore
4.26% [kernel] [k] _raw_spin_unlock_irq
4.02% [kernel] [k] __memcpy
3.02% [kernel] [k] format_decode
2.36% [kernel] [k] xfs_dir2_data_entsize
2.28% [kernel] [k] vsnprintf
1.99% [kernel] [k] __do_softirq
1.93% [kernel] [k] xfs_dir2_data_get_ftype
1.88% [kernel] [k] number.isra.14
1.84% [kernel] [k] _xfs_buf_find
1.82% [kernel] [k] ___might_sleep
1.61% [kernel] [k] strnlen
1.49% [kernel] [k] queue_work_on
1.48% [kernel] [k] string.isra.4
1.21% [kernel] [k] __might_sleep

Oh, I'm running CONFIG_XFS_DEBUG=y, so internal runtime consistency
checks consume most of the CPU (__xfs_dir3_data_check,
xfs_dir_ino_validate). IOWs, real world readdir performance will be
much, much faster than I've demonstrated.

Other than that, the most CPU is spent on copying dirents into the
user buffer (copy_user_generic_string), passing dirents to the user
buffer (filldir) and extracting dirents from the on-disk buffer
(xfs_dir2_leaf_getdents). The we have lock contention, ramdisk IO
(memcpy), some vsnprintf stuff (includes format_decode, probably
debug code) and some more dirent information extraction functions.

it's not until we get to _xfs_buf_find() do we see a buffer cache
lookup function, and that's actually comsuming less CPU than the
__might_sleep/____might_sleep() debug annotations. That puts it in
persepective just how little overhead readdir buffer caching
actually has compared to everything else.

IOWs, these numbers indicate that readdir caching overhead has no
real impact on the performance of hot cache readdir operations.

So, back to the question I asked that you didn't answer: exactly
what are you proposing to cache in the VFS readdir cache? Without
knowing that, I can't make any sane comment on about technical merit
of your proposal....

> Both readdir() and path component lookup are technically read
> operations, so why the hell do we use a mutex, rather than just
> get a read-write lock for reading? Yeah, it's that (d) above. I
> might trust xfs and ext4 to get their internal exclusions for
> allocations etc right when called concurrently for the same
> directory. But the others?

They just use a write lock for everything and *nothing changes* -
this is a simple problem to solve.

The argument "filesystem developers are stupid" is not a
compelling argument against changing locking. You're just being
insulting, even though you probably don't realise it.

[snip more rhetoric about the page cache being the only solution]

> Basically, in computer science, pretty much all performance work
> is about caching. And readdir is the one area where the VFS layer
> doesn't do well, falls on its face and punts back to the
> filesystem.

Caching is used to hide the problems of the lower layers. If the
lower layers don't have a problem, then another layer of caching is
not necessary.

Linus, what you haven't put together is a clear statement of the
problem another layer of readdir caching is going to solve. What
workload is having problems? Where are the profiles demonstrating
that readdir caching is the issue, or the solution to the issue you
are seeing? We know about plenty of workloads where directory access
concurrency is a real problem, but I'm not seeing the problem you
are trying to address...


Dave Chinner
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at