EXT2 and contiguous files

Matti Aarnio (matti.aarnio@tele.fi)
Fri, 9 Jan 1998 18:42:39 +0000 (/etc/localtime)


Hi,

Here is a writeup about EXT2 modification, which I think will boost
system performance with massive file-format databases. Consider
single files of 1kB blocks, those need triple-level indirection
system on block addressing, once the file size exceeds circa 66 k
blocks (alias 66 MB). lseek() around that file for more than circa
256k, and you need to get new third level indirection pointer block
before you can find the target page. Possibly you need to get
three indirection blocks before you can find the final target.

For those who don't know what the indirection scheme I am referring
to is about, it goes like this:
- The i-node contains 12 32-bit values storing direct block addresses
for the first 12 blocks of the file. Amount of the meta-data needed
to be fetched to reach those blocks is nil.
- The 13th block pointer points to a block containing 32-bit datablock
addresses, which with 1kB block sizes means that block can describe
up to 256 additional datablocks, or (256+12) kB of data.
- The scheme extends by two mode i-node slots adding more layers of
indirection -- mid-blocks pointing to blocks pointing to datablocks.

Remember, access of random block at the disk is EXPENSIVE (timewise).
It may take as little as 5 milliseconds with fast disks, but it may
take as much as 25 milliseconds with not so hot ones..

Now, dive in.

/Matti Aarnio <matti.aarnio@tele.fi> <mea@nic.funet.fi>

Contiguous file block allocation coding for EXT2+
------------------------------------------------

When thinking of the classical file allocation schemes in UNIXes
with a few direct addressing pointers, and then singly-to-triply
indirecting pointer pages, the files with larger sizes get more
and more of meta-data which slows down the file access rather
dramatically. This is especially important detail with database
files.

Having seen alternates, mainly DEC UNIX AdvFS contiguous file
facility in which the metadata contains block ranges; telling
the first addressed block, and the number of blocks in the range.

Such facility might be able to handle discontinuous files, also
known as "holed" or "sparse" files -- lseek() far away, and write
a bit, lseek elsewere, write again. However the plan I chose to
do for Linux EXT2 does not allow discontinueties.
(Mainly to avoid complexities at joining/splitting ranges.)

In one system slightly modified e2fsck (based on version 1.10)
reported following information regarding non-contiguous file
allocations; the utility calls them "fragments", although that
detail is somewhat misnomer, because EXT2 does not have fragments
per se..:

68171 inodes used (16%)
2274 non-contiguous inodes (3.3%), 9058 fragments (304 highest)
# of inodes with ind/dind/tind blocks: 14905/418/0
1237070 blocks used (73%)
0 bad blocks

61917 regular files

Cumulative percentual summs, and instance counts
per dis-contiguouty (+1) counts per file in between
counts of 1-15 (or rather up to 99.9 % of cases..):

96.710 59904 1 99.805 4 9
98.250 954 2 99.824 12 10
98.657 252 3 99.842 11 11
99.131 294 4 99.858 10 12
99.432 186 5 99.881 14 13
99.730 185 6 99.887 4 14
99.777 29 7 99.902 9 15
99.798 13 8 ...

That is, with 5 ranges I can handle up to 99.4% of
my files..

The cumulative percentual summs, and instance
counts of filesizes in bins of 100/1000 blocks:

97.559 60430 100 99.798 32 700
99.001 893 200 99.811 8 800
99.398 246 300 99.834 14 900
99.583 115 400 99.847 8 1000
99.664 50 500 99.947 62 2000
99.747 51 600 ...

The worst fragmented file (304 ranges) had 2587 blocks
in it. On the other hand, largest 10 files had these
range counts -- in between 2 and 20 range entries would
cover these files, largest one would need only two ranges:
8155 2 9447 10
8248 2 9454 6
8283 3 9549 19
8931 12 9582 7
9118 6 9676 2

In every case, with i-node containing range specifications, 96.7% (!)
of those would need SINGLE range entry without any indirection blocks.
The worst case would use 304 ranges. More about these below.

Doing range entries in triplets of:
- First block number
- First block address
- blockcount
To have the range start offset (blocknum) there helps to do binary
search for target block number, because we can't rely on linear summing
from the start of the chain to find the next range.

This means 12 bytes per each range entry. With following block-sizes
we can thus derive:

Triplets Indirected ranges:
BlkSize u32's per block Singly- Doubly-indir
512 128 42 42 64* 42+ 42 = 2730
1024 256 85 85 128* 85+ 85 = 10965
2048 512 170 170 256*170+170 = 43690
4096 1024 314 314 512*314+314 = 161082
8192 2048 682 682 1024*682+682 = 699030

The usual case of blocksize of 1024 bytes thus yielding singly
indirected range block of 85 entries (+ base of 3, see below),
which in the sample analysis did go well into the statistical
tail-end of single instances. The double-indirected scheme goes
way beyond any sane need.

In case of double-indirection, the middle level contains a block
of pairs listing the blocknumber of the first range in the leaf
block, and then the address of the leaf block with ranges.

With following indirection scheme in the i-node i_block[15] array:
0: Number of ranges for this file (+)
1: 1st range's start block address in the dist
2: 1st range's blockcount
3: 2nd range's start block number
4: 2nd range's start block address in the disk
5: 2nd range's blockcount
6: 3rd range's start block number
7: 3rd range's start block address in the disk
8: 3rd range's blockcount
9: 0 4th (*)
10: Top allocated block number 4th (*)
11: Blocknumber of first singly indirected range 4th (*)
12: Address of the first indirector block 5th (*)
13: Blocknumber of first double-indirected range 5th (*)
14: Address of the double-indirector block 5th (*)

(+) In the i-node i_block[] array the first range has implicite start
block number of 0.

(*) With up to 5 ranges, ALL of the range entries are kept in
the i_block[] array, the addition of 6th range causes creation of
the first indirection block, and 4th-5th entries are moved there along
with the new 6th entry, and the primary format is established in the
i_block[] array. While going down, cutting down to 5 or 4 entries
causes the first indirected range block to be read for their first two,
or one, entries, which are copied into i_block[].

Another possible optimization is to use the 13/14 i_block[] slots with
second singly indirected page as long as the total range count does not
exceed 3 + RANGES_IN_BLOCK * 2. When that limit is reached, the double-
indirection block must be added. (But that is for extremely badly dis-
contiguous files!) [ 1kB blocks give limit of 173 range entries. ]

In a valid range-entry block there is always initial u32 with content
of 0, and then start the triplets of range entries. In such setup,
the second u32 is ALWAYS non-zero. (Divide any power of two with 3, and
you will always get 1 or 2 as a reminder, thus we can "waste" the first
u32..)

Can the file be converted from the classical block-addressing scheme
to contiguous range scheme by issueing a suitable ioctl() to it
remains to be seen. I don't plan that at first, anyway.

Neither I plan to support sparse files. Think a moment what happens
to range entries with sparse files, and you will realize why they are
bad idea to support.

/Matti Aarnio <matti.aarnio@tele.fi> <mea@nic.funet.fi>