Re: [PATCH] Clustering indirect blocks in Ext3

From: Theodore Tso
Date: Fri Nov 16 2007 - 21:59:22 EST


On Fri, Nov 16, 2007 at 04:25:38PM -0800, Abhishek Rai wrote:
> Ideally, this is how things should be done, but I feel in practice, it
> will make little difference. To summarize, the difference between
> my approach and above approach is that when out of free blocks in a
> block group while allocating indirect block, the above approach repeats
> the same allocation algorithm in the next block group, while I fully
> fall back to old-style allocation meaning the indirect block gets
> co-located with the data block in the next block group with a free
> block.

Well, also I suggested that if the metacluster region is full, that it
attempt to find a block starting at end of the metacluster region and
then wrap around, instead of starting at the beginning of the block
group. That way it's more likely that subsequent metadata block is
nearer to the previous metadata blocks.

> In practice, this will make a difference only for one indirect
> block as from next request onwards the goal will be updated to the new
> group making the behavior like what you propose. Still, I think your
> suggestion is cleaner and I'll change to that.

The practice of starting search in the next block block in the
metadata area only makes a difference for one indirect block, yes, but
it's the right thing to do. And if you fold the ext3_new_blocks and
ext3_new_indirect_blocks(), it's really not that hard. You can
basically do something like this:

if (alloc_for_metadata)
strategy = 0x132;
else
strategy = 0x231;
for (; strategy; strategy = strategy >> 8) {
switch (strategy & 0xF) {
case 1:
start = block_group_start;
end = mc_start - 1;
break;
case 2:
start = mc_start;
end = mc_end;
break;
case 3:
start = mc_end + 1;
end = block_group_end;
break;
}
<search region between start.. end>
}

> We initially avoided making metaclustering a superblock tunable as we
> didn't want to make any changes to the on-disk format as then ext4
> extents are also a good option.

Allocating a superblock field is no big deal. I'll note further that
metaclustering is not necessarily mutually exclusive with ext4
extents. Allocating the extent tree data blocks out of the
metacluster blocks can be a good idea, depending on the average size
of the blocks and how fragmented the filesystem gets (and hence how
many contiguous extents can be expected). If the filesystem is
storing lots of really big files where being contiguous across
multiple blockgroups are productive, then the metacluster area would
actually be counterproductive. And if files are all small so the
extents fit the inode, the metadata cluster area wouldn't be necessary
at all. But if there are multiple external extent blocks in a block
group, it would be useful for them to be allocated together.

> If metaclustering gains acceptance
> it might make sense to make it a superblock tunable. However, I would
> avoid putting metacluster size into the superblock for the following
> reason. Ideally, we should not have to bother about finding the sweet
> spot of metacluster size as
> (1) a given file system can be used for storing different kinds
> of files at different times and it would be a pain to tune it every now
> and then, and

Yes, it doesn't make sense to retune the filesystem. I was assuming
that this would only be done at mke2fs time.

> (2) it opens the possibility of doubting metcluster size for unrelated
> ext3/fsck performance anomalies.

I'm not sure I understand your concern. The reality is that 99% of
the time users will never change it from the defaults, but making it
tunable makes it much, much easier for us to try various experiments
to determine what is the best initial value for different workloads.
What might get used for a Usenet news spool or a Squid cache might be
quite different from series of DVD image files.

> Allow me to propose a solution that will most likely address the above
> issue and please ignore its complexity for a moment. Instead of a two
> level partitioning in the block space between data blocks and
> metacluster blocks, have a 3 or 4 level partitioning. E.g., a block
> group with 'd' blocks can have d/32 blocks in metacluster level 1,
> d/64 blocks in metacluster level 2, and d/128 blocks in metacluster
> level 3 (define level 0 has having the remaining blocks = d - d/32 -
> d/64 - d/128). Data block allocation starts looking for a free block
> starting from the lowest possible level. If it is unable to find any
> free blocks at that level in all block groups, it moves up a level and
> so on. Indirect block allocation proceeds in the opposite direction
> starting from higher levels. This approach has several benefits:

That is clever. Oh, one other thing. You didn't mention what
happened when the metacluster field was placed at the end of the block
group. I assume you tried that in your experiments; what were the
results? The obvious thing to do to avoid further fragmentation of
the block group would be to put level 1 at the end of the block group,
level 2 just before it, and level 3 before that, and then allocate the
data blocks starting at the beginning of the block group, i.e:

+----------------------------------+---------------+---------+-------+
| data | level 3 | level 2 | lvl 1 |
+----------------------------------+---------------+---------+-------+


> In traditional metaclustering, once we run out of metacluster blocks
> or data blocks, all bets are off. This forces us to keep small
> metaclusters in order to avoid this situation altogether. But with small
> metaclusters, we cannot optimize indirect block allocation on file
> systems with many small files (>48KB).There is only one glitch in
> implementing this. If a block group doesn't have any free blocks at a
> given level, we should be able to find that out quickly instead of
> having to scan its entire bitmap. gdp->bg_free_blocks_count is not good
> enough for this.

Ideally, true, but this was a defect with the original metacluster
scheme as well. We could steal some bits in the block_group
descriptor structure to indicate whether a particular level is full,
though. This would be another data format change that would require
e2fsprogs support, though.

Regards,

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/