Re: [reiserfs-dev] Re: Ext2 directory index: ALS paper and benchmarks

From: Daniel Phillips (
Date: Sat Dec 08 2001 - 22:27:29 EST

On December 8, 2001 06:54 pm, Jeff Garzik wrote:
> Daniel Phillips wrote:
> > Linus wrote:
> > > So "ext2_write_inode()" would basically become somehting like
> > >
> > > struct ext2_inode *raw_inode = inode->u.ext2_i.i_raw_inode;
> > > struct buffer_head *bh = inode->u.ext2_i.i_raw_bh;
> > >
> > > /* Update the stuff we've brought into the generic part of the
inode */
> > > raw_inode->i_size = cpu_to_le32(inode->i_size);
> > > ...
> > > mark_buffer_dirty(bh);
> > >
> > > with part of the data already in the right place (ie the current
> > > "inode->u.ext2_i.i_data[block]" wouldn't exist, it would just exist as
> > > "raw_inode->i_block[block]" directly in the buffer block.
> note we do this for the superblock already, and it's pretty useful

The difference is, there's only one superblock per mount. There are
of inodes.

> > I'd then be able to write a trivial program that would eat inode+blocksize
> > worth of cache for each cached inode, by opening one file on each itable
> > block.
> you already have X overhead per inode cached... yes this would increase
> X but since there is typically more than one inode per block there is
> also sharing as well. So inode+blocksize is not true.

You skipped over my example too fast.

> > I'd also regret losing the genericity that comes from the read_inode
> > and update_inode (repack) abstraction.
> so what is write_inode... re-repack? :)

It's a trivial shell for update_inode:

> > Right now, I don't see any fields in
> > _info that aren't directly copied, but I expect there soon will be.
> i_data[] is copied, and that would be nice to directly access in
> inode->u.ext2_i.i_bh...

Yes, that's the major one, it's 60 bytes, more than the other _info fields
put together. However, almost half the itable block data is going to be
redundant, and the proposal as I understand it is to lock it in cache while
the inode is in cache. This makes things worse, not better - it reduces the
total number of inodes that can be cached. And that's the best case,
when *all* the inodes on an itable block are in cache. Take a look at the
inode distribution in your directories and see if you think that's likely.

> > An alternative approach: suppose we were to map the itable blocks with
> > smaller-than-blocksize granularity. We could then fall back to smaller
> > transfers under cache pressure, eliminating much thrashing.
> in ibu fs the entire inode table[1] is accessing via the page cache.
> ext2 could do this too. If ext2's per-block-group inode table has
> padding at the end page calculations get a bit more annoying but it's
> still doable.

That's roughly what I had in mind, for starters.

It's worth keeping in mind that tweaking the icache efficiency in this case
is really just curing a symptom - the underlying problem is a mismatch
between readdir order and inode order.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at

This archive was generated by hypermail 2b29 : Sat Dec 15 2001 - 21:00:14 EST