Re: Filesize limitation

Kristian Koehntopp (kris@koehntopp.de)
Wed, 5 Nov 1997 18:45:04 +0100 (CET)


In netuse.lists.linux-kernel you write:
>How difficult would it be to upgrade this to a ext3 filesystem with
>32 bits reserved for gid and uid, and 64 bits for length and block count?:

It is possible, but it would make the on-disk inode slightly
larger than 128 byte, so there is about 100 unused bytes per
inode. These bytes could be used otherwise (more direct blocks,
for example), which would be a very traditional and evolutionary
approach to the problem of large files. Such an ext3 would
relate to ext2 as xiafs relates to minixfs: It simply extends
the existing structures, but does nothing to improve the
infrastructure.

Or one could go the whole way and change the way the file system
organizes blocks on disk completely. Thinking this to an end,
with large files and large numbers of files and large direcories
in mind, brings us straight into a wonderful world of extents,
trees and multiple locks per file. See NTFS and the code of von
Loewis or reiserfs for some examples of tree based file systems.
Or have a look at Silicon Graphics XFS, which does a whole lot
for handling files in the TB range (or a TB range number of
small files - you choose) (*1).

SGI was even more radical in their changes, if I understood
their paper (*2) correctly: They found that the current
structure of their buffer cache, where caches blocks are tagged
with major:minor:block#-Tags, did not suit them well. Their
problem was the use of block#s to access the cache, because that
forced early layout decisions on file writes.

For example, when a file is being extended, a new block has to
be allocated for that file in the cache and the cache page will
be tagged with an absolute block number in that partition - the
system has to determine where to place the new block on disk as
soon as the block is being allocated. At this point in time the
block is not yet written to disk nor is known if and how many
blocks will follow. For the common case where a single file is
written start to end and where that file is comparatively small,
the filesystem is forced to make layout decisions sequentially
and immediately where they could possibly be deferred and done
for the whole file or at least larger portions of it.

If the cache pages were tagged with maj:min:ino#:rel_block#
instead (or additionally), pages could be placed into the buffer
cache without the need to determine their physical position on
disk, yet. When the cache is being flushed, its contents could
be sorted by ino# and then by rel_block#. It would be relatively
easy to determine extents (start_block#, length) of blocks
belonging to the same file, which could then be placed into
appropriately sized extents of free blocks on disk.

This leaves us with the problem of finding appropriately sized
extents, which is difficult to solve with block bitmaps: Find
the smallest extent of free blocks which is large enough to hold
the required number of blocks. If there are multiple extents of
that size, choose the nearest one (If there isn't one, break up
your extent into smaller extents and place them individually,
that is, fragment your file). With bitmaps, this is a
complicated and slow searching operation, counting adjacent runs
of free bits. But since in XFS all data structures are trees,
free space is organized in a tree of extents, which is indexed
by extent size and extent starting block. Finding the required
extent is easy with such a data structure.

Sorry for boring you all to death with theoretical ideas instead
of proven code - I just wanted to illustrate that the proper
solution of the very large file problem is more complex than
just widening a few fields in a structure on disk.

Kristian

(*1) SGI quotes one configuration based on a challenge with 24
SCSI controllers driving 24 RAID arrays with a single,
array spanning XFS filesystem. This installation was able
to draw a sustained data stream of some 380 MB/sec
(Megabytes per second, sustained, not peak) from their
disks using four concurrent threads written in C, reading
via regular filesystem system calls. Quite impressive, I
think.

(*2) Sweeney, "Scalability in the XFS File System", presented at
the 1996 USENIX tech conference (quoted from memory).