Re: readahead on directories

From: Phillip Susi
Date: Thu Apr 22 2010 - 15:24:08 EST


On 4/22/2010 1:53 PM, Jamie Lokier wrote:
> Right, but finding those blocks is highly filesystem-dependent which
> is why making it a generic feature would need support in each filesystem.

It already exists, it's called ->get_blocks(). That's how readahead()
figures out which blocks need to be read.

> support FIEMAP on directories should work. We're back to why not do
> it yourself then, as very few programs need directory readahead.

Because there's already a system call to accomplish that exact task; why
reinvent the wheel?

> If you're interested, try finding all the places which could sleep for
> a write() call... Note that POSIX requires a mutex for write; you
> can't easily change that. Reading is easier to make fully async than
> writing.

POSIX doesn't say anything about how write() must be implemented
internally. You can do without mutexes just fine. A good deal of the
current code does use mutexes, but does not have to. If your data is
organized well then the critical sections of code that modify it can be
kept very small, and guarded with either atomic access functions or a
spin lock. A mutex is more convenient since it it allows you to have
much larger critical sections and sleep, but we don't really like having
coarse grained locking in the kernel.

> Then readahead() isn't async, which was your request... It can block
> waiting for memory and other things when you call it.

It doesn't have to block; it can return -ENOMEM or -EWOULDBLOCK.

> Exactly. And making it so it _never_ blocks when called is a ton of
> work, more lines of code (in C anyway), a maintainability nightmare,
> and adds some different bottlenecks you've not thought off. At this
> point I suggest you look up the 2007 discussions about fibrils which
> are quite good: They cover the overheads of setting up state for async
> calls when unnecessary, and the beautiful simplicty of treating stack
> frames as states in their own right.

Sounds like an interesting compromise. I'll look it up.

> No: In that particular case, waiting while the indirect block is
> parsed is advantageous. But suppose the first indirect block is
> located close to the second file's data blocks. Or the second file's
> data blocks are on a different MD backing disk. Or the disk has
> different seeking characteristics (flash, DRBD).

Hrm... true, so knowing this, defrag could lay out the indirect block of
the first file after the first 12 blocks of the second file to maintain
optimal reading. Hrm... I might have to try that.

> I reckon the same applies to your readahead() calls: A queue which you
> make sure is always full enough that threads never block, sorted by
> inode number or better hints where known, with a small number of
> threads calling readahead() for files, and doing whatever is useful
> for directories.

Yes, and ureadahead already orders the calls to readahead() based on
disk block order. Multithreading it leads the problem with backward
seeks right now but a tweak to the way defrag lays out the indirect
blocks, should fix that. The more I think about it the better this idea
sounds.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/