Re: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks

From: Linus Torvalds
Date: Thu May 14 2015 - 11:51:33 EST

On Thu, May 14, 2015 at 4:23 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> IIRC, ext4 readdir is not slow because of the use of the buffer
> cache, it's slow because of the way it hashes dirents across blocks
> on disk. i.e. it has locality issues, not a caching problem.

No, you're just worrying about IO. Natural for a filesystem guy, but a
lot of loads cache really well, and IO isn't an issue. Yes, there's a
bad cold-cache case, but that's not when you get inode semaphore
contention. You get contention when you have lots of concurrent
accesses to the same directory, and then the data is all nice and hot
in the caches. But readdir() _still_ sucks donkey ass by the
bucket-load for that case.

And that's the case I'm talking about. Using the buffer cache for
readdir() is a complete disaster, because it means that

(a) you have to go down to the filesystem, wasting CPU resources, and
more importantly, going into code that by definition hasn't been
optimized as well and cannot ever be, because it's not common code
that everybody sees.

(b) you have to look up the physical block number, wasting even
*more* CPU resources, because the buffer heads are physically indexed

(c) you then use the buffer head lookup, which itself isn't horrible,
but it's not as well optimized as the page cache is.

(d) and because we call into the filesystem, not only is the code not
getting as much attention as the vfs layer, we generally can't trust
filesystem guys to get locking right (because 90% of the filesystems
don't get the attention they need even _without_ locking, and the 10%
that does is maintained by people who worry mainly about IO). So the
VFS layer has no real choice except to use a big-hammer "lock the
whole damn directory" approach.

End result: readdir() wastes a *lot* of time on stupid stuff (just
that physical block number lookup is generally more expensive than
readdir itself should be), and it does so with excessive locking,
serializing everything.

Both readdir() and path component lookup are technically read
operations, so why the hell do we use a mutex, rather than just get a
read-write lock for reading? Yeah, it's that (d) above. I might trust
xfs and ext4 to get their internal exclusions for allocations etc
right when called concurrently for the same directory. But the others?

I saw you talk about how the aio IO paths are "better" than the
regular page cache paths just a few days ago (when talking about
persistent memory). You're completely and utterly out to lunch,
*especially* with things like persistent memory, where the IO paths
wouldn't even *exist*, because things never get out of the cache. And
that out to lunch on this comes from your total fixation with IO. The
page cache is one studly mf in the normal cases when things are
cached, BUT YOU NEVER EVEN SEE THAT. Why? Because your filesystem code
never gets called for it, and the page cache ends up having almost
perfect behavior. It scales perfectly, and it scales with good

I understand where you are coming from, but caching really really
works. You ignore that, because you don't see those things, and the
caching case never affects you.

The readdir path? It sucks. And it sucks exactly because it's done in
the filesystem, and not in some VFS caches that we could actually make
go fast. We can't cache it well.

Basically, in computer science, pretty much all performance work is
about caching. And readdir is the one area where the VFS layer doesn't
do well, falls on its face and punts back to the filesystem.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at