Re: [00/17] Large Blocksize Support V3

From: Eric W. Biederman
Date: Thu Apr 26 2007 - 13:51:21 EST

Christoph Hellwig <hch@xxxxxxxxxxxxx> writes:

> On Thu, Apr 26, 2007 at 04:50:06PM +1000, Nick Piggin wrote:
>> Improving the buffer layer would be a good way. Of course, that is
>> a long and difficult task, so nobody wants to do it.
> It's also a stupid idea. We got rid of the buffer layer because it's
> a complete pain in the ass, and now you want to reintroduce one that's
> even more complex, and most likely even slower than the elegant solution?

No. I'm really suggesting improving the translation from BIO's
to the page cache. A set of helper functions.

This patch is suggesting we move to a BSD like buffer cache, except
that everything is physically mapped.

My most practical suggestion is to have support code so that you can
do all of the locking (that I/O cares about) on the first page of a
page group in the page cache. You don't need larger physical pages to
do that.

>> Well, for those architectures (and this would solve your large block
>> size and 16TB pagecache size without any core kernel changes), you
>> can manage 1<<order hardware ptes as a single Linux pte. There is
>> nothing that says you must implement PAGE_SIZE as a single TLB sized
>> page.
> Well, ppc64 can do that. And guess what, it really painfull for a lot
> of workloads. Think of a poor ps3 with 256 from which the broken hypervisor
> already takes a lot away and now every file in the pagecache takes
> 64k, every thread stack takes 64k, etc? It's good to have variable
> sized objects in places where it makes sense, and the pagecache is
> definitively one of them.

Agreed the page cache is all about variable sized objects known as files!
You don't need to do anything extra. The problem is only with building
I/O requests from what is there.

Iff we really the larger physical page size to support the hardware
then it makes sense to go down a path of larger pages. But it doesn't.

There is also a more fundamental reasons this patch is silly. It assumes
that there is a trivial mapping between filesystems (the primary target
of the page cache and blocks on disk). Now I admit this is common but
there is no reason to supposed it is true, and this patch appears to
exacerbate things.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at