Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem

From: Christian Brauner
Date: Tue Jan 17 2023 - 10:29:00 EST


On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote:
> Christian Brauner <brauner@xxxxxxxxxx> writes:
>
> > On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote:
> >> > It seems rather another an incomplete EROFS from several points
> >> > of view. Also see:
> >> > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@xxxxxxxxxxxxx/T/#u
> >> >
> >>
> >> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19,
> >> where the community reactions rhyme with the reactions to composefs.
> >> The discussion on Incremental FS resembles composefs case even more [1].
> >> AFAIK, Android is still maintaining Incremental FS out-of-tree.
> >>
> >> Alexander and Giuseppe,
> >>
> >> I'd like to join Gao is saying that I think it is in the best interest
> >> of everyone,
> >> composefs developers and prospect users included,
> >> if the composefs requirements would drive improvement to existing
> >> kernel subsystems rather than adding a custom filesystem driver
> >> that partly duplicates other subsystems.
> >>
> >> Especially so, when the modifications to existing components
> >> (erofs and overlayfs) appear to be relatively minor and the maintainer
> >> of erofs is receptive to new features and happy to collaborate with you.
> >>
> >> w.r.t overlayfs, I am not even sure that anything needs to be modified
> >> in the driver.
> >> overlayfs already supports "metacopy" feature which means that an upper layer
> >> could be composed in a way that the file content would be read from an arbitrary
> >> path in lower fs, e.g. objects/cc/XXX.
> >>
> >> I gave a talk on LPC a few years back about overlayfs and container images [2].
> >> The emphasis was that overlayfs driver supports many new features, but userland
> >> tools for building advanced overlayfs images based on those new features are
> >> nowhere to be found.
> >>
> >> I may be wrong, but it looks to me like composefs could potentially
> >> fill this void,
> >> without having to modify the overlayfs driver at all, or maybe just a
> >> little bit.
> >> Please start a discussion with overlayfs developers about missing driver
> >> features if you have any.
> >
> > Surprising that I and others weren't Cced on this given that we had a
> > meeting with the main developers and a few others where we had said the
> > same thing. I hadn't followed this.
>
> well that wasn't done on purpose, sorry for that.

I understand. I was just surprised given that I very much work on the
vfs on a day to day basis.

>
> After our meeting, I thought it was clear that we have different needs
> for our use cases and that we were going to submit composefs upstream,
> as we did, to gather some feedbacks from the wider community.
>
> Of course we looked at overlay before we decided to upstream composefs.
>
> Some of the use cases we have in mind are not easily doable, some others
> are not possible at all. metacopy is a good starting point, but from
> user space it works quite differently than what we can do with
> composefs.
>
> Let's assume we have a git like repository with a bunch of files stored
> by their checksum and that they can be shared among different containers.
>
> Using the overlayfs model:
>
> 1) We need to create the final image layout, either using reflinks or
> hardlinks:
>
> - reflinks: we can reflect a correct st_nlink value for the inode but we
> lose page cache sharing.
>
> - hardlinks: make the st_nlink bogus. Another problem is that overlay
> expects the lower layer to never change and now st_nlink can change
> for files in other lower layers.
>
> These operations have a cost. Even if we all the files are already
> available locally, we still need at least one operation per file to
> create it, and more than one if we start tweaking the inode metadata.

Which you now encode in a manifest file which changes properties on a
per file basis without any vfs involvement which makes me pretty uneasy.

If you combine overlayfs with idmapped mounts you can already change
ownership on a fairly granular basis.

If you need additional per file ownership use overlayfs which gives you
the ability to change file attributes on a per file per container basis.

>
> 2) no multi repo support:
>
> Both reflinks and hardlinks do not work across mount points, so we

Just fwiw, afaict reflinks work across mount points since at least 5.18.

> cannot have images that span multiple file systems; one common use case
> is to have a network file system to share some images/files and be able
> to use files from there when they are available.
>
> At the moment we deduplicate entire layers, and with overlay we can do
> something like the following without problems:
>
> # mount overlay -t overlay -olowerdir=/first/disk/layer1:/second/disk/layer2
>
> but this won't work with the files granularity we are looking at. So in
> this case we need to do a full copy of the files that are not on the
> same file system.
>
> 3) no support for fs-verity. No idea how overlay could ever support it,
> it doesn't fit there. If we want this feature we need to look at
> another RO file system.
>
> We looked at EROFS since it is already upstream but it is quite
> different than what we are doing as Alex already pointed out.
>
> Sure we could bloat EROFS and add all the new features there, after all
> composefs is quite simple, but I don't see how this is any cleaner than
> having a simple file system that does just one thing.
>
> On top of what was already said: I wish at some point we can do all of
> this from a user namespace. That is the main reason for having an easy
> on-disk format for composefs. This seems much more difficult to achieve

I'm pretty skeptical of this plan whether we should add more filesystems
that are mountable by unprivileged users. FUSE and Overlayfs are
adventurous enough and they don't have their own on-disk format. The
track record of bugs exploitable due to userns isn't making this
very attractive.