Re: [PATCH 0/5] hybrid union filesystem prototype

From: Neil Brown
Date: Fri Aug 27 2010 - 07:35:31 EST


On Fri, 27 Aug 2010 10:47:39 +0200
Miklos Szeredi <miklos@xxxxxxxxxx> wrote:

> Hi Neil,
>
> On Fri, 27 Aug 2010, Neil Brown wrote:
> > My first problem with this that there isn't nearly enough documentation.
> > So I offer the following to fix this problem. Please correct anything that I
> > got glaringly wrong. I don't claim it is at all complete, but touches on the
> > things that I thought were interesting.
>
> Whoa, thank you very much. I'll add it to the patchset with some
> fixes (see below).
>
> > Hybrid Union Filesystem
> > =======================
> >
> > This document describes a prototype for a new approach to providing
> > union-filesystem functionality in Linux.
> > A union-filesystem tries to present the union of two different filesystems as
> > though it were a single filesystem. The result will inevitably fail to look
> > exactly like a normal filesystem for various technical reasons. The
> > expectation is that many use cases will be able to ignore these differences.
> >
> > This approach is 'hybrid' because the objects that appear in the filesystem
> > do not all appear to belong to that filesystem. In many case an object
> > accessed in the hybrid-union will be indistinguishable from accessing the
> > corresponding object from the original filesystem. This is most obvious
> > from the 'st_dev' field returned by stat(2). Some objects will report an
> > st_dev from one original filesystem, some from the other, none will report an
> > st_dev from the union itself.
>
> Hmm, that's a bug. Directories actually come from the union itself
> for various reasons, and it does report the correct st_ino but not
> st_dev. Fixed.

I was wondering why you even bothered to fiddle st_ino. Given that lots of
other things come from one fs or the other, having the merged directories
appear to be from the upper filesystem doesn't seem like a problem. I don't
really care either way though.

> >
> > The lower filesystem can be any filesystem supported by Linux and does not
> > need to be writable. It could even be another union.
>
> Almost. Xattr namespace issues currently prevent that, but with
> escaping it could be worked around if necessary.

But you never access the xattrs for the lower directory so that shouldn't
matter.
Have a union for the upper fs would certainly be sufficiently 'interesting'
to explicitly forbid.

> >
> > Changes to underlying filesystems
> > ---------------------------------
> >
>
> For now I refuse to even think about what happens in this case.
>
> The easiest way out of this mess might simply be to enforce exclusive
> modification to the underlying filesystems on a local level, same as
> the union mount strategy. For NFS and other remote filesystems we
> either
>
> a) add some way to enforce it,
> b) live with the consequences if not enforced on the system level, or
> c) disallow them to be part of the union.
>

I actually think that your approach can work quite will with either the
upper or lower changing independently. Certainly it can produce some odd
situations, but even NFS can do that (though maybe not quite so odd).

It think the best approach would be to handle the few that can be handled and
explicitly document the rest - people might even find them useful.


Anyway, here is my next instalment which are the review comments, now that I
have some documentation to read :-)

In no particular order:

1/ You call union_remove_whiteouts on an upper directory once you have
determined that the merged directory is empty, which implies that the only
things in the the upper directory are whiteouts.
union_remove_whiteouts calls union_unlink_writeout on every entry.
It checks that each entry is a DT_LNK - which we assume it must be - but
doesn't check the the inode there is really a whiteout.
It seem inconsistent to do the one check but not the other.

As this could race with independent changes on the upper filesystem, I
thick it would be safest to at least check the i_mode and i_size of the
dentry->d_inode that is found, and possible do a readlink as well to
ensure we only delete whiteouts.

1a/ A minor optimisation for union_is_white would be to check i_size matches
size of (union-whiteout)

2/ It bothers me that the 'dev_name' arg to union_get_sb is unused, yet there
are mandatory options. I think it would be nice if dev_name were required
to be lower,upper and options were left for other things.
So the typical usage would be:

mount -t union /path/to/lower,/path/to/upper /mount/point

3/ I think it would be great if you made use of d_revalidate to handle some
of the worst problems caused by underlying filesystem changes. If you
cache i_mtime and i_version of the parent in the dentry and re-do the
lookup if either is different from the directory you could hide most of
the issues mentioned in the doco about dentries expiring. And if you
called d_revalidate in the underlying filesystem at the same time you could
probably hide all the rest.

4/ The problem you mentioned about ->i_mutex which is taken during unlink
being quasi-global could be eased somewhat by simply having a small pool
of inodes and assigning them to dentries pseudo-randomly. Or possibly
having one 'file' inode per directory (that might be a bit wasteful
though).

5/ I wonder if it would be useful to recognise 'fallthrough' objects (much
like whiteouts but inverted) and to optionally (based on mount option) copy
up any directory on readdir and fill it with fall-throughs for any lower
name that isn't in the upper. This helps with enormous directories
(though not if upper is RAMFS of course) and ensures a stable directory
cookie.


While I like the idea of being able to work with changeable filesystems and
think most scenarios can be handled adequately, there is one that I'm not
comfortable with.
If you open a file readonly and get a handle on the file in the lower
filesystem, and then you fchmod the handle, it will work but will change the
lower filesystem - which you don't really want (I think).
The only way to avoid this currently is to require the lower vfsmnt to be
mounted readonly. That is probably acceptable (it can be mounted read-write
elsewhere) but I'd like it to be easier than that.
An alternate would be to teach mnt_want_write_file about some new flag which
marks the 'struct file' as 'really-readonly' and have union_open set that
flag. Note sure if I really like that or not.
Probably for now just:

6/ require __mnt_is_readonly(lowerpath.mnt) at mount time.

Finally, I think it is really important to document all the non-standard
aspects of the unioned filesystem in some detail and suggest work-around for
potentially problematic behaviour.
The fact that a read-only open can get a lower-filesystem filehandle has a
lot of interesting consequences.
If you take a read-lock, it doesn't stop an other process writing to the file
(though it does stop the file from changing!).
If you request a DNOTIFY on a directory, you will not see when things get
added.
If you take a lease on a file it won't be broken by someone trying to write.

So if you:
- create a union.
- start a daemon that watches for changes
- make changes

The daemon could never notice. Of course if you stop everything and restart
it will then work smoothly, as the 'make-changes' will have copied-up the
directory.

Most such confusion can be easily avoided by 'touching' any file or directory
that will be monitored for changes or will be involved in locking. Having to
do that after creating a union is a hassle, but only a minor hassle I think,
and may not actually be needed at all in many cases.

On the whole, I really like this approach. It strikes a good balance between
simplicity and functionality. It doesn't provide perfect semantics, but I
think it provides good-enough semantics and a very moderate cost.

Thanks,
NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/