Re: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks

From: NeilBrown
Date: Tue May 19 2015 - 04:33:47 EST


On Sun, 17 May 2015 19:56:26 -0700 Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Sun, May 17, 2015 at 4:16 PM, NeilBrown <neilb@xxxxxxx> wrote:
> >
> > Just to be crystal clear about what I want:
> > I want the filesystem to be in control
>
> Yeah, no. Not going to happen.
>
> You seem to think that the dcache is "just" a cache. It's not. It's a
> cache, but that is absolutely not all that it is. It's very much a
> cache with strong semantics.
>
> And no, we're not handing over those semantics over to the filesystem.
> The dcache is not just a cache, it's the *primary* data structure that
> we use for pathname validation, local security checking, and for doing
> things like "getcwd()" and handling ".." etc.

A fact that makes it relatively easy to create a situation where 'getcwd()'
returns a string for which 'stat' says ENOENT or where "cd .." puts you
somewhere that "getcwd" gets quite upset:

$ cd ..
cd: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

$ ls -l /proc/self/cwd
lrwxrwxrwx 1 neilb users 0 May 19 17:28 /proc/self/cwd -> /mnt/tmp/cdir (deleted)

(and no: I hadn't deleted the cwd, just renamed some things on the server)

>
> So there's no way the filesystem is "in control". You as a filesystem
> are not really even doing the actual pathname lookup. The *only* thing
> you're doing is filling in the dcache. The actual real pathname lookup
> is done by the VFS layer using the dcache data.
>
> That's how it very fundamentally works. It's *so* much more than a
> cache - it really *is* the primary path lookup. The filesystem is the
> slave in this relationship.

This requires the VFS to have knowledge, sometimes intimate knowledge, of how
each filesystem works.
DCACHE_NFSFS_RENAMED ?
Oh wait, afs and btrfs know about that too, so it can't be too intimate.

>
> > The filesystem then uses generic helpers (or not) to find the answers and adds
> > more current information to the cache.
>
> You can do that already. There *are* those generic helpers to add data
> to the cache. That's what "d_instantiate()" and friends _are_ for.
>
> But no, you do *not* control name lookup. You get notified when
> there's not enough data in the cache, and then you can fill it up any
> which way you want.
>
> You can populate the dcache with other entries than the one we asked
> for, and you can ask the dcache to revalidate and throw dentries out.
>
> But no, you do *not* get access to things like do_last() or to the
> decision to follow symlinks or namespace rules, or mountpoints or
> things like that.

Obviously the important rules that you mention would be handled by library
code. But do_last() could be a lot simpler if the filesystem could manage the
'stale dentry' handling and call one version or the other of do_last
depending on whether it had an 'atomic_open' callback or not.

>
> > So for Al's example of revalidating multiple components at once, once the VFS
> > gets to a point in the path where d_revalidate says "I need more time",
> > the VFS just passes the rest of the path to the filesystem.
>
> That's bullshit,. for a very simple and basic reason: "the rest of the
> path" is not necessarily at all for your filesystem!

For revalidate: probably not, though the filesystem can ask questions of the
dcache just as easily as the VFS. For lookup, the rest of the path up to a
".." or symlink (which the filesystem can easily recognise) does belong to
the filesystem.

On this topic, Al suggested:

> With bulk revalidate covering
> all the chain when we stumble across .., mountpoint or something we believe
> to be a symlink, or when the chain reaches fs-specified limit.

That "fs-specific limit" is what really bothers me. This is feeding more
information about the fs into the VFS, and it assumes that a "limit" is the
thing that is meaningful for the VFS to know. Just let the FS take over and
use the approved interfaces to collect the dentries that it thinks might be
useful to revalidate, and then revalidate them.


>
> Really. There might be mount-points, there might be symlinks, there
> might be tons of stuff like that.
>
> You're not getting control, for the very simple reason that IT IS NOT
> YOUR DATA. And it really never ever will be.
>
> Now, this is why I said we can do a "hint" style thing. Part of that
> "hint" issue is very very much that it has no semantic meaning. You
> can't screw it up, because if it turns out that the path component
> we're looking up is a symlink and we actually end up in some other
> filesystem, if you end up looking up the hint part, it just would
> never actually get used.
>
> So it's kind of like a prefetch for names. It's semantically much
> weaker than saying "look up this name". The hint would be "this is
> likely the next part of the name that the VFS layer will look up".
>
> And the key part of that statement is
> (a) "likely" (it might not happen, and even if it does happen, it
> migth not be for your filesystem)
> and
> (b) "the VFS layer will look up" because it won't be the low-level
> filesystem doing it.
>
> So it would be the low-level filesystem pre-populating the dcache - if
> the low-level filesystem decides the hint is worth using for that -
> and the VFS layer then uses the data in the dcache without further
> bothering the filesystem.
>
> Exactly because the dcache is *so* much more than "just a cache".
>
> Linus

Well, let's just start with that "in-lookup" or "unknown" dentry that has
been mentioned, so that the VFS doesn't have to hold i_mutex across lookup
and create, and so that the filesystem can at least control it's own locking.

That would be a big step forward in my mind.

Thanks,
NeilBrown

Attachment: pgp0uHhwzn4D7.pgp
Description: OpenPGP digital signature