Re: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks
From: NeilBrown
Date: Sun May 17 2015 - 00:48:26 EST
On Sat, 16 May 2015 21:04:34 -0700 Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Sat, May 16, 2015 at 8:48 PM, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > Sorry, but that really is how it is. NFS isn't special enough for some
> > badly designed lookup models to matter one whit.
>
> Btw, it's not just about performance, although the whole "we can do
> cached lookups without ever having to et the filesystem involved" is a
> big deal.
>
> It's about getting fundamental concpets like mount points etc right,
> it's about all those those things that the filesystem really doesn't
> know about, and _cannot_ sanely know about.
>
> It's now about things like overlayfs etc, all those things.
>
> So the filesystem really isn't in control. Never will be. The
> filesystem is at the mercy of (extended) unix semantics that are
> bigger than the filesystem.
>
> This is true of IO too. The filesystem does have a bit more
> flexibility, but in the end, you have to do the readpage thing,
> because it's the only way you'll get mmap. The filesystem isn't really
> in control there either, there are strict rules for what it has to do
> in order to have reasonable coherent mmap semantics.
>
> So the vfs layer often does have a "library" approach, because
> filesystems may do things in very different ways. But at the same
> time, the vfs layer really *is* in control, because it's the vfs layer
> that enforces certain basic semantics. So the dcache very much isn't
> just sme "slave cache" that you choose to use and is at the control of
> the filesystem. Like the page cache, you don't get a choice, because
> you aren't in charge.
Last I checked, sysfs doesn't use the page cache.
Of course, sysfs is a special case (but then, aren't we all -- deep down).
I like the page cache. Really do. It provides useful services and lots of
'generic' helpers. You can use the 'generic' versions directly, or wrap them
in a little bit of extra code, or rewrite them completely. It's lovely.
>
> When somebody does a lookup of a filename, it is not a "pass this
> filename to the filesystem". It very much *is* a
> component-by-component lookup. And in the *vast* majority of the
> cases, the cached lookup when you don't even get asked is absolutely
> the right thing to do, and doing anything else wouldn't just be wrong,
> it would be completely and utterly stupid.
I think you must have been reading someone else's emails, not mine. I'm
totally there with the cached lookups. They are awesome. Don't want
anything else. But when the cache doesn't have the answer - what then? The
filesystem is most likely to know how to fill the cache most efficiently.
I remember hunting after some problem a while ago. I don't remember the
exact details but it was related to when NFS is asked to perform permission
checks on the way to opening something. I'm pretty sure it involved
atomic_open() as a key part.
Anyway, the code is/was very hairy and seemed to be convoluted in order to
try to meet every bodies needs at once.
Having one piece of code that tries to handle the subtle details for all
filesystems is, I think, a mistake. Certainly have a block of code that does
the 'easy, local filesystem' version. But don't try to combine the
necessarily-different NFS version into the same block of code. It becomes
(nearly) unreadable.
I know there are interesting complex cases for open: O_EXCL and trailing
symlinks and things certainly make it interesting. But pretending that all
filesystems can be squashed into the one mould is just a pretence.
NeilBrown
>
> And the fact that somebody doesn't understand that, and has designed
> bad extensions to do multi-component lookup, isn't actually an
> argument against the dcache. It's just an argument for "people make
> bad intterfaces because they hack things up and don't understand
> things".
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Attachment:
pgpcqJbWI_iRL.pgp
Description: OpenPGP digital signature