Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

From: Andreas Dilger
Date: Mon Mar 20 2006 - 01:34:48 EST


On Mar 18, 2006 12:07 -0500, Theodore Ts'o wrote:
> What I'm trying to simplify is the overhead of users trying to
> understand a tangled mess of features, some compat, some incompat,
> etc.

I think the real goal is that 99% of users never really see this
in the first place. Add in support for these features now with the
most permissive COMPAT flag possible, and chances are that most
normal users won't use these features for a few years. Most of
them are for the 1% of systems that are pushing the current filesystem
limits, and the majority of these users are also more sophisticated.
You won't have Joe Average dual-booting their Linux system with a
16TB filesystem.

> > I think i_blocks can be considered an ROCOMPAT feature, and the large
> > inode reservation for usecond timestamps could be COMPAT I think (since
> > an unsupporting kernel would still update all the timestamps consistently
> > even if the useconds on disk would be some constant instead of 0.

NB - my reference for i_blocks was the use of i_frag|i_fsize for use
by files > 2TB, not the recent proposal for i_blocks in fs blocksize.

> i_blocks can ROCOMPAT only if it is acceptable for stat(2) to return
> erroneous i_blocks return values. I'm not entirely convinced that's a
> good thing, and at the very least it would be extremely confusing, but
> maybe.

I think most cases where someone is ro-mounting their filesystem is
when they need emergency access to the filesystem with an older kernel.
Allowing that access IMHO is more important than exact correctness.
I'm also not aware of any tools that depend on i_blocks being correct,
though I suspect e.g. "cp --sparse" will use a discrepency in
i_size vs (i_blocks >> 9) to determine sparseness.

> usecond timestamps must be at least ROCOMPAT, because of the
> requirement that all newly created inodes must reserve extra space and
> guarantee that i_extra_isize must be at least n bytes (where n is the
> size of the guaranteed extra inode fields). If you don't do that,
> then when the filesystem is mounted one again on a kernel that does
> understand usec timestamps, some inodes will have room for the usec
> time fields, and other inodes won't (because they have too much of the
> space used for EA's), and that will cause serious problems for make(1).

What happens to existing filesystems with large inodes that don't have
enough space for the extra timestamps in the first place? Also, if files
are created while the filesystem is mounted without usecond timestamps
they would get no usecond fields anyways. I agree that there are some
unlikely corner conditions that might be hit (large inode filesystem, on
older kernel without usec support, fills both the in-inode and external
block so much that there isn't 12 bytes left for the usecond timestamps,
and that file happens to depend on the exact accuracy of the timestamp).
IMHO the inconvenience of the ROCOMPAT outweighs the benefits.

We have previously restricted ROCOMPAT and INCOMPAT flags for changes
that would cause corruption or crashes on older kernels. In the end,
I'm not dead-set against making it ROCOMPAT, just trying to maintain
the maximal compatibility possible.

> for cases where we find that increasing the blocksize to
> 8k or even larger (32k? 64k) really helps, but we don't want to pay
> the internal fragmentation penalty for small files. There are other
> ways to solve the problem, yes, such as by assuming that we can use a
> different filesystem for database or video streams separate from the
> /, /usr, and/or /var filesystems, for example.
>
> If we are ready to forever forswear wanting to use large block sizes,
> then maybe we don't need to worry about fragmentations support (or
> maybe the 1.8" pedabyte disk drives will show up and be cheap enough
> that we just won't care about wasting space on small files). But
> that's I think a decision which we need to formally make.

I'm not against large block support (in fact I was hoping this would be
standard by now), or fragment support. Rather, I think that nobody
cares enough about it to actually implement it, and given the growth
of disks the demand will never materialize, just like ext2 filesystem
compression missed the window when the benefits outweighed the costs.

If and when large-page/block support makes it into a commodity CPU
(sadly, x86_64 missed the mark) or kernel we can re-evaluate it then.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/