Re: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks

From: NeilBrown
Date: Fri May 15 2015 - 21:25:22 EST


On Fri, 15 May 2015 17:45:56 -0700 Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Fri, May 15, 2015 at 4:30 PM, NeilBrown <neilb@xxxxxxx> wrote:
> >
> > .. and I've been wondering what to do about i_mutex and NFS. I've had
> > customer reports of slowness in creating files that seems to be due to
> > i_mutex on the directory being held over the whole 'create' RPC, so only one
> > of those can be in flight at the one time.
> > "make -j" on a large source directory can easily want to create lots of
> > "*.o" files at "the same time".
> >
> > And NFS doesn't need i_mutex at all because the server will provide the
> > needed guarantees.
>
> So i_mutex on a directory is probably the nastiest lock we have in the fs layer.
>
> It's used for several different half-related things:
>
> - serialize filename creation/deletion
>
> This is partly for the benefit of the filesystem itself (and not
> helpful for NFS, as you note), but it's also very much about making
> sure we have uniqueness guarantees at the VFS layer too.
>
> So even with NFS, it's not just "the server provides the needed
> guarantees", because some of the guarantees are really client-local.
>
> For example, simply that we only ever have one single dentry for a
> particular name, and that we only ever have one active lookup per
> dentry. Those things happen independently of - and before - the server
> even sees the operation.

But surely those things can be managed with a spinlock.

I think a big part of the problem is that the VFS tries to control
filesystems rather than provide services to them.
If NFS was in control it might:
- ask the dcache to look up a name and get back a dentry: positive, negative
or not-validated.
- if positive, NFS returns an error, or uses the inode - depending on
operation.
- otherwise send a 'create' request to the server. At this point it holds
references to dentries for directory and target, but no locks.
- If an error is returned, just drop references and return.
- On successful create, turn the filehandle into an inode and then
ask the dcache to link the inode with the target dentry.
If the dentry is still negative or not-validated, this is trivial.
If it is positive and already has the right inode, again trivial.
If it has the wrong inode, you certainly have an interesting problem,
but one that is specific to NFS (or similar filesystems) and one
that is up to NFS to solve, not up to the VFS to avoid.
If the syscall doesn't need to return an 'fd', then just drop the
references and report success.
If an 'fd' is required, then create a 'deleted' dentry attached to
the original parent. As 'mkdir' doesn't return an fd, that should
be safe (not that all the fuss about directory dentries having exactly one
parent is particularly relevant to NFS)

I'm not convinced that serialising 'lookup' calls is vital. If two threads
find a 'not-validated' dentry, and both try to look up the inode, they
will both ultimately get the same struct_inode from the icache, and will both
succeed in connecting it to the dentry. Obviously it would be better to
avoid two concurrent NFS "LOOKUP" requests, but that is a problem for NFS to
solve. I suspect that using d_fsdata to point to a pending LOOKUP request
would allow the "second" thread to wait for that request to finish. Other
filesystems would take a completely different approach.

But with the VFS trying to be in control and trying to accommodate the needs
of wildly different filesystems, I imagine it might not be so easy.

>
> So the whole local directory tree consistency ends up depending on this.
>
> - readdir(). This is mostly to make it hard for filesystems to do the
> wrong thing when there is concurrent file creation.

and makes it impossible for them to do the right thing too?


>
> I suspect readdir could fairly easily push the i_mutex down from the
> caller and into the filesystem, and then filesystems might narrow down
> the use (or even get rid of it). The initial patch might even be
> automated with coccinelle. However, rather few loads actually have a
> lot of readdir() activity, and samba is probably the only major one.
> I've seen benchmarks where it matters, but they are rare (and I
> haven't seen one in literally years).
>
> So the readdir case could probably be at least relaxed fairly easily.
> But the thing that tends to hurt on more loads is, as you note, the
> filename lookup/creation/movement case. And that's much harder to fix.
>
> Al, do you have any ideas? Personally, I've wanted to make I_mutex a
> rwsem for a long time, but right now pretty much everything uses it
> for exclusion. For example, filename lookup is clearly just reading
> the directory, so it should take a rwsem for reading, right? No. Not
> the way it is done now. Filename lookup wants the directory inode
> exclusively because that guarantees that we create just one dentry and
> call the filesystem ->lookup only once on that dentry.
>
> Again, there tend to be no simple benchmarks or loads that people care
> about that show this. Most of the time it's fairly hard to see.
>

Which "this"? Readdir? I haven't come across one (NFS READDIR issues are
all about when to use READDIR_PLUS and when not to).
Or create-contention-on-i_mutex? The load my customer had was 'make -j60' on
a many-cpu machine. I'm not sure if all the .o files were in the one
directory or if some logging was being recorded to multiple files in that one
directory, but the serialization of file-creation over NFS was measurable.
It wasn't horribly bad, but it was noticeably suboptimal. I think I
recommended creating whatever-it-was in multiple directories.

NeilBrown

Attachment: pgpRw_lp3JFgR.pgp
Description: OpenPGP digital signature