Re: [00/17] Large Blocksize Support V3

From: Nick Piggin
Date: Thu Apr 26 2007 - 11:20:34 EST

David Chinner wrote:
On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote:

Christoph Lameter wrote:

On Thu, 26 Apr 2007, Nick Piggin wrote:

No I don't want to add another fs layer.

Well maybe you could explain what you want. Preferably without redefining the established terms?

Support for larger buffers than page cache pages.

The problem with this approach is that it turns around the whole
way we look at bufferheads. Right now we have well defined 1:n
mapping of page to bufferheads and so we tpyically lock the
page first them iterate all the bufferheads on the page.

Going the other way, we need to support m:n which we means
the buffer has to become the primary interface for the filesystem
to the page cache. i.e. we need to lock the bufferhead first, then
iterate all the pages on it. This is messy because the cache indexes
via pages, not bufferheads. hence a buffer needs to point to all the
pages in it explicitly, and this leads to interesting issues with

If you still think that this is a good idea, I suggest that you

Yeah, I think it possibly is. But I don't think that much of the
vm should need to be touched at all, especially not mmap. And I
think only those filesystems interested in supporting it would
have to implement it (maybe a couple), not all of them.

I didn't run into all the issues you mentioned yet, and I don't
see why it would have to turn into a real (traditional) buffer
cache layer rather than just a Linux (ie. block mapping) buffer

So block size > page cache size... also, you should obviously be using
hardware that is tuned to work well with 4K pages, because surely there
is lots of that around.

The CPU hardware works well with 4k pages, but in general I/O
hardware works more efficiently as the numbers of s/g entries they
require drops for a given I/O size. Given that we limit drivers to
128 s/g entries, we really aren't using I/O hardware to it's full
potential or at it's most efficient by limiting each s/g entry to a
single 4k page.

And FWIW, a having a buffer for block size > page size does not
solve this problem - only contiguous page allocation solves this

But this seems to be due to your final IO request size, rather than
the number of sg entries the card has to process, right? My point
is that MMUs and the whole memory and data transfer system is quite
likely to be optimised for 4K pages, so it would be strange to think
that an IO card could not do that.

Why do we limit drivers to 128 sg entries?

SUSE Labs, Novell Inc.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at