Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

From: Josh Triplett
Date: Thu Oct 08 2020 - 18:23:12 EST


On Wed, Oct 07, 2020 at 10:10:17PM -0400, Theodore Y. Ts'o wrote:
> On Wed, Oct 07, 2020 at 01:14:24PM -0700, Josh Triplett wrote:
> > That sounds like a conversation that would have been a lot more
> > interesting and enjoyable if it hadn't started with "can we shoot it in
> > the head", and continued with the notion that anything other than
> > e2fsprogs making something to be mounted by mount(2) and handled by
> > fs/ext4 is being "inflicted", and if the goal didn't still seem to be
> > "how do we make it go away so that only e2fsprogs and the kernel ever
> > touch ext4". I started this thread because I'd written some userspace
> > code, a new version of the kernel made that userspace code stop working,
> > so I wanted to report that the moment I'd discovered that, along with a
> > potential way to address it with as little disrupton to ext4 as
> > possible.
>
> What is really getting my dander up is your attempt to claim that the
> on-disk file system format is like the userspace/kernel interface,
> where if we break any file system that file system that was
> "previously accepted by an older kernel", this is a bug that must be
> reverted or otherwise fixed to allow file systems that had previously
> worked, to continue to work. And this is true even if the file system
> is ***invalid***.
>
> And the problem with this is that there have been any number of
> commits where file systems which were previously invalid, but which
> could be caused to trigger a syzbot whine, which was fixed by
> tightening up the validity tests in the kernel. In some cases, I had
> to also had to fix up e2fsck to detect the invalid file system which
> was generated by the file system fuzzer. Yes, it's unfortunate that
> we didn't have these checks earlier, but a file system has a huge
> amount of state.
>
> The principle you've articulated would make it impossible for me to
> fix these bugs, unless I can prove that the failure to check a
> particular invalid file system corruption could lead to a security
> vulnerability. (Would it be OK for me to make the kernel more strict
> and reject an invalid file system if it triggers a WARN_ON, so I get
> the syzbot complaint, but it doesn't actually cause a security issue?)
>
> So this conversation would have been a lot more pleasant for *me* if
> you hadn't tried to elevate your request to a general principle, where
> if someone is deliberately generating an invalid file system, I'm not
> allowed to make the kernel more strict to detect said invalidity and
> to reject the invalid / corrupted / fuzzed file system.

I appreciate you providing this explanation; this makes the nature and
severity of your concern *much* more clear. I also now understand why I
was feeling confused, and why it felt several times like we were talking
past each other; see below. For my part, I had no desire to have
generated this level of concern. The restrictions you're describing
above are far, far past anything I had possibly envisioned or desired. I
want to clarify a couple of things.

I wasn't trying to make a *new* general principle or policy. I was under
the impression that this *was* the policy, because it never occurred to
me that it could be otherwise. It seemed like a natural aspect of the
kernel/userspace boundary, to the point that the idea of it *not* being
part of the kernel's stability guarantees didn't cross my mind. Thus,
when you were suggesting a policy of "only filesystems generated by the
e2fsprogs userspace tools are acceptable", that seems like you were
suggesting a massive loosening of existing policy. And when I suggested
an alternative of "only filesystems that pass e2fsck are acceptable", I
was also trying to suggest a loosening of the existing policy, just a
different, milder one. I would have come to this discussion with a very
different perspective if it had occurred to me that this might not be
the policy, or at least that it simply hadn't come up in this way
before.

I would be quite happy to have a discussion about where that stability
boundary should be. I also have no desire to be inflexible about that
boundary or about the evolution of the software I'm working on; this is
not some enterprise "thou shalt not touch" workload. On the contrary,
within reason it's *relatively* straightforward for me to evolve
filesystem generation code to deal with changes in the kernel or e2fsck.
I was not looking to push some new general principle on stability; I
only wished to ensure that it was reasonable to negotiate about the best
way to solve problems if and when they arise, and I hope they do not
arise often. (Most of the time, I'd expect that when they arise I'll
end up wanting to change my software rather than the kernel.)

I also don't think that a (milder) stability guarantee would mean that
the kernel or fsck can never become stricter. On the contrary, they both
*should* absolutely try to detect more issues over time. Much like other
instances of the userspace/kernel boundary, changes that could
hypothetically break some userspace software aren't automatically
disqualified, only changes that *actually* break userspace in practice.
I regularly see the kernel making a change that would technically affect
some userspace software, but either no such software appears to exist
(and the question could be revisited if it did), or any such software
that does exist is not affected in a harmful way (e.g. the software
copes with any error produced), or the software can be relatively easily
patched to work around the new requirements and the nature of the
software is such that people won't be running older versions, or any
number of other practical considerations.

Stability is a practical boundary rather than a theoretical concern.
It's one that can be discussed and negotiated based on the nature of a
change, and its practical impact. I suggested a kernel patch because
that seemed like a straightforward approach with minimal impact to the
kernel's validity checking. Darrick's suggestion in the next mail is an
even better approach, and one I'm likely to use, so I'd have no problem
with the patch that started this thread *not* being applied after all
(though I appreciate the willingness to consider it).

Finally, I think there's also some clarification needed in the role of
what some of the incompat and ro_compat flags mean. For instance,
"RO_COMPAT_READONLY" is documented as:
> - Read-only filesystem image; the kernel will not mount this image
> read-write and most tools will refuse to write to the image.
Is it reasonable to interpret this as "this can never, ever become
writable", such that no kernel should ever "understand" that flag in
ro_compat? I'd assumed so, but this discussion is definitely leading me
to want to confirm any such assumptions. Is this a flag that e2fsck
could potentially use to determine that it shouldn't check
read-write-specific data structures, or should that be a different flag?

> > The short version is that I needed a library to rapidly turn
> > dynamically-obtained data into a set of disk blocks to be served
> > on-the-fly as a software-defined disk, and then mounted on the other
> > side of that interface by the Linux kernel. Turns out that's *many
> > orders of magnitude* faster than any kind of network filesystem like
> > NFS. It's slightly similar to a vvfat for ext4. The less blocks it can
> > generate and account for and cache, the faster it can run, and
> > microseconds matter.
>
> So are you actually trying to dedup data blocks, or are you just
> trying to avoid needing to track the block allocation bitmaps?

Both. I do actually dedup some of the data blocks, when it's easy to do
so. The less data that needs to get generated and go over the wire, the
better.

> And are you just writing a single file, or multiple files? Do you
> know what the maximum size of the file or files will be? Do you need
> a complex directory structure, or just a single root directory?

It's an arbitrary filesystem hierarchy, including directories, files of
various sizes (hence using inline_data), and permissions. The problem
isn't to get data from point A to point B; the problem is (in part) to
turn a representation of a filesystem into an actual mounted filesystem
as efficiently as possible, live-serving individual blocks on demand
rather than generating the whole image in advance.

> Can the file system be sparse?

Files almost certainly won't be (because in practice, it'd be rare
enough that it's not worth the time and logic to check). The block
device is in a way; unused blocks aren't transmitted.

> So for example, you can do something like this, which puts all of the
> metadata at beginning of the file system, and then you could write to
> contiguous data blocks. Add the following in mke2fs.conf:

(Note that the main reason mke2fs wouldn't help is that it generates
filesystems in advance, rather than on the fly.)

> [fs_types]
> hugefile = {
> features = extent,huge_file,bigalloc,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2

Other than uninit_bg, that's pretty much the list of feature flags I'm setting.
I'm also enabling 64bit, inline_data, largedir, and filetype.

> cluster_size = 32768

Many files are potentially small, so 4k blocks make sense. (I've even
considered 1k blocks, but that seemed like a less-well-trodden path and
I didn't want to be less efficient for larger files. 4k seems to be a
good balance, and has the advantage of matching the page size.)

> So if you had come to the ext4 list with a set of requirements, it
> could have been that we could have come up with something which uses
> the existing file system features, or come up with something which
> would have been more specific --- and more importantly, we'd know what
> the semantics were of various on-disk file system formats that people
> are depending upon.

I've mentioned specific aspects of what I'm doing at least twice: once
in the thread about 128-bit inodes and inline data (which got a good
response, and ended up convincing me not to go that route even though
it'd be theoretically possible), and once in the feature request asking
for support in e2fsck for shared bitmap blocks (which didn't get any
further followups after I'd provided additional information).

> > If at some point I'm looking to make ext4 support more than it already
> > does (e.g. a way to omit bitmaps entirely, or a way to express
> > contiguous files with smaller extent maps, or other enhancements for
> > read-only filesystems),
>
> See above for a way to significantly reduce the number of bitmaps.
> Adding a way to omit bitmaps entirely would require an INCOMPAT flag,
> so it might not be worth it.

I agree that it'd be of limited value, compared to the amount of change
required.

> The way to express contiguous files with smaller extent files would be
> to extend the kernel to allow file systems with block_size > page_size
> read-only. This would allow you to create a file system with a block
> size of 64k, which will reduce the size of the extent maps by a factor
> of 16, and it wouldn't be all that hard to teach ext4 to support these
> file systems. (The reason why it would be hard for us to support file
> systems with block sizes > page size is dealing with page cache when
> writing files while allocating blocks, especially when doing random
> writes into a sparse file. Read-only would be much easier to
> support.)

That seems like it would work well for a filesystem that mostly
contained larger files, but not a mix of small files and occasional
large files. (Imagine, for one possible instance, a normal Linux
system's root filesystem with everything on it, plus a single 200GB
file.) It's more that I'd like to avoid generating and serving (and
having the kernel parse and recurse through) a full set of extent tree
blocks; it would be nice if an extent record could represent more than
32k blocks. It'd be even nicer if *inline* extents could represent an
arbitrarily long file as long as the file was contiguous.

Would it be reasonable, particularly on a read-only filesystem (to avoid
the possibility of having to break it into a massive extent tree on the
fly), to add a way to represent a completely contiguous or mostly
contiguous file using only the 60-byte in-inode region, and no auxiliary
extents? Having extents with a 48-bit number of blocks would be helpful,
and in theory a contiguous file doesn't exactly need ee_block.

> So please, talk to us, and *tell* us what it is you're trying to do
> before you try to do it. Don't rely on some implementation detail
> where we're not being sufficiently strict in checking for an invalid
> file system, especially without telling us in advance and then trying
> to hold us to the lack of checking forever because it's "breaking
> things that used to work".

I'd be happy to start more such conversations in the future. And as
mentioned above, I am *not* looking for that level of guarantee of never
checking *anything* that used to be checked; I'm looking to
collaboratively figure out what the best solution is. I don't want the
default assumption to be that the kernel needs to remain perfectly
accepting of everything it did in the past, nor do I want the default
assumption to be that userspace should always change to meet the new
kernel requirements.

- Josh Triplett