Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

From: Josh Triplett
Date: Wed Oct 07 2020 - 16:14:36 EST


On Wed, Oct 07, 2020 at 10:32:11AM -0400, Theodore Y. Ts'o wrote:
> On Wed, Oct 07, 2020 at 01:03:04AM -0700, Josh Triplett wrote:
> > > But can we *please* take your custom tool out back and shoot it in the
> > > head?
> >
> > Nope. As mentioned, this isn't about creating ext4 filesystem images,
> > and it isn't even remotely similar to mke2fs.
>
> Can you please tell us what your tool is for, then? Why are you doing
> this? Why are you inflicting this on us?

That sounds like a conversation that would have been a lot more
interesting and enjoyable if it hadn't started with "can we shoot it in
the head", and continued with the notion that anything other than
e2fsprogs making something to be mounted by mount(2) and handled by
fs/ext4 is being "inflicted", and if the goal didn't still seem to be
"how do we make it go away so that only e2fsprogs and the kernel ever
touch ext4". I started this thread because I'd written some userspace
code, a new version of the kernel made that userspace code stop working,
so I wanted to report that the moment I'd discovered that, along with a
potential way to address it with as little disruption to ext4 as
possible.

I'm not looking to be an alternative userspace, or an alternative
anything; that implies serving more-or-less the same function
differently. I have no interest in supplanting mke2fs or any other part
of e2fsprogs; I'm using those for many of the purposes they already
serve.

The short version is that I needed a library to rapidly turn
dynamically-obtained data into a set of disk blocks to be served
on-the-fly as a software-defined disk, and then mounted on the other
side of that interface by the Linux kernel. Turns out that's *many
orders of magnitude* faster than any kind of network filesystem like
NFS. It's slightly similar to a vvfat for ext4. The less blocks it can
generate and account for and cache, the faster it can run, and
microseconds matter.

ext4 has *incredible* compatibility guarantees. I'd already understood
the whole compat/ro_compat mechanism when I read through the on-disk
format documentation and the code. RO_COMPAT_SHARED_BLOCKS *seemed* like
the right semantic description of "don't ever try to write to this
filesystem because there are deduplicated blocks", and
RO_COMPAT_READONLY seemed like the right semantic description for "don't
ever try to write this, it's permanently read-only for unspecified
reasons".

If those aren't the right way to express that, I could potentially
adapt. I had a similar such conversation on linux-ext4 already (about
inline data with 128-bit inodes), which led to me choosing to abandon
128-byte inodes rather than try to get ext4 to support what I wanted
with them, because I didn't want to be disruptive to ext4 for a niche
use case. In the particular case that motivated this thread, what I was
doing already worked in previous kernels, and it seemed reasonable to
ask for it to continue to work in new kernels, while preserving the
newly added checks in the new kernels.

If the response here had been more along the lines of "could we create
and use a *different* compat flag for shared metadata and have
RO_COMPAT_SHARED_BLOCKS only mean shared data blocks", I'd be fine with
that.

If at some point I'm looking to make ext4 support more than it already
does (e.g. a way to omit bitmaps entirely, or a way to express
contiguous files with smaller extent maps, or other enhancements for
read-only filesystems), I'd be coming with architectural discussions
first, patches second, and at no point would I have the expectation that
ext4 folks need to do extra work on my behalf. I'm happy to do the work.
The *only* thing I'm asking, here, is "don't break things that worked".
And after this particular item, I'd be happy to narrow that to "don't
break things that e2fsck was previously happy with".

- Josh Triplett