Re: [reiserfs-dev] Re: Ext2 directory index: ALS paper and benchmarks

From: Daniel Phillips (
Date: Sun Dec 09 2001 - 15:13:34 EST

On December 9, 2001 05:19 am, Linus Torvalds wrote:
> On Sun, 9 Dec 2001, Daniel Phillips wrote:
> >
> > The difference is, there's only one superblock per mount. There are
> > bazillions of inodes.
> .. and they share buffers.

Not consistently, though this is something that could and should be improved.
Try "ls -i | sort" to see what I mean.

> I bet that having a pointer to the buffer will _reduce_ average
> average footprint rather than increase it. The inodes are fairly densely
> laid out in the buffers, so I dare you to find any real-world (or even
> very contrieved) usage patterns where it ends up being problematic.

I didn't have to look any further than my laptop. Home and etc look like
near-worst cases (because of slow growth and doubtful inode allocation
policy); /bin looks pretty close to the best case, but still it's very rare
to find a full 32 inodes densely allocated. The tradeoff is this: for each
itable block member that happens to be in cache simultaneously you can save
68 bytes by pinning the itable block buffer and pointing directly at it,
while incurring the cache cost of pinning an additional 60 bytes that are
never accessed outside read/update_inode. For each itable block member that
isn't in cache (perhaps because it wasn't allocated) you pin an extra 128
bytes. The break-even is 6 uncached inodes/block - any more and you're
pinning more cache than you're saving.

The case that hurts is where inode access order is non-coherent, which is the
case we're trying to fix. If inode access order *is* coherent, you don't
have a problem because once you're done with an entire itable block it can be
evicted and not read back any time soon.

You're also touching an extra cacheline/inode by following the pointer, for
what it's worth [not much, because we're doing get_block/truncate when
following the link, making the cacheline hit look microscopic].

Continuing the little warts list, there's Alan's comment re needing endian
reversal on big endian machines. I'm also looking at the locking
suspiciously. Directly operating on the buffer it looks like we can happily
change the data even in the middle of a dma. This is murky stuff, I'd be
happy to be corrected on it. I wonder what Ext3 impact we'd have here?

> Remember: we'd save 15*4=60 bytes per inode, at the cost of pinning the
> block the inode is in. But _usually_ we'd have those blocks in memory
> anyway, especially if the inode gets touched (which dirties it, and
> updates atime, which forces us to do writeback).

So by always pinning the itable block you'd be eroding the benefit that can
be obtained through inhibiting atime updating. Even with atime, at least
it's only the in-memory inode being dirtied. When it comes time to update,
itable blocks can be read in as necessary, updated and evicted as the icache
continues to exert pressure. Caveat: I'm rapidly wandering off into obscure
cache balancing issues here, little understood and probably poorly handled.

> For the initial IO we obviously _have_ to have them in memory.
> And we'll get rid of inodes under memory pressure too, so the pinning will
> go away when memory is really tight. Yes, you can try to "attack" the
> machine by trying to open all the right inodes, but the basic attack is
> there already. ulimit is your friend.

Unfortunately, it's not an attack, it's a not-so-rare real life situation -
not to the extreme of my example, but close enough to hurt.

> > It's worth keeping in mind that tweaking the icache efficiency in this
> > case is really just curing a symptom - the underlying problem is a
> > mismatch between readdir order and inode order.
> Well, the inode writeback read-modify-write synchronization and related
> efficiency problems actually has nothing to do with the readdir order.

I've measured the effect. Ignoring seeks, the worst case access order is:

    hit_inum(i%n*m + i/n)

where n is the number of itable blocks and m is inodes/block, i.e., for each
itable block, access a single inode on it, repeat until all inodes have been
accessed. Here, pinning inode blocks is going to multiply the number of
inode evictions, even if we set noatime. Cases that approach this degree of
non-coherence aren't rare at all.

The bottom line is, I think you can find cases where your strategy for
reducing double caching saves a little, whereas I can find case where it
hurts a lot.

What if we could do inode-granular transfers through an inode address_space?
This will eliminate the double caching entirely, because we're always able to
generate the full on-disk inode from the cached inode. Then it costs little
to evict the inode buffer immediately after every transfer. It seems to me
that the cost of this would be just a table of states per address_space unit,
plus some fiddling with bio, which we're doing anyway.


- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to More majordomo info at Please read the FAQ at

This archive was generated by hypermail 2b29 : Sat Dec 15 2001 - 21:00:15 EST