Re: [00/17] Large Blocksize Support V3

From: Nick Piggin
Date: Fri Apr 27 2007 - 06:06:08 EST

Christoph Hellwig wrote:
On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote:

Well maybe you could explain what you want. Preferably without redefining the established terms?

Support for larger buffers than page cache pages.

I don't think you really want this :) The whole non-pagecache I/O
path before 2.3 was a toal pain just because it used buffers to drive
I/O. Add to that buffers bigger than a page and you add another
two mangnitudes of complexity. If you want to see a mess like that
download on of the eary XFS/Linux releases that had an I/O path
like that. I _really_ _really_ don't want to go there.

I'm not actually suggesting to add anything like that. But I think
larger blocks can be doable while retaining the "buffer" layer as a
relatively simple pagecache to block translation.

Anyway, I'm working on patches... they might crash and burn, but we
might have something to talk about later.

Linux has a long tradition of trading a tiny bit of efficieny for
much cleaner code, and I'd for 100% go down Christoph's route here.
Then again I'd actually be rather surprised if > page buffers
were more efficient - you'd run into shitloads over overhead due to
them beeing non-contingous like calling vmap all over the place,
reprogramming iommus to at least make them look virtually contingous [1],

I still think hardware should work reasonably well with 4K pages. The
SGI io controllers and/or the Linux block layer that doesn't allow more
than 128 sg entries is clearly suboptimal if the hardware runs twice as
fast with 2MB submissions.

I also don't quite get what your problem with higher order allocations
are. order 1 allocations are generally just fine, and in fact
thread stacks are >= oder 1 on most architectures. And if the pagecache
uses higher order allocations that means we'll finally fix our problems
with them, which we have to do anyway. Workloads continue to grow and
with them the kernel overhead to manage them, while the pagesize for
many architectures is fixed. So we'll have to deal with order 1
and order 2 allocations better just for backing kmalloc and co.

The pagecache is much bigger and often a lot more activity than these
other things though. Also, the more things you add to higher order
allocations, the more pressure you have.

I like PAGE_SIZE pagecache, because it is reliable and really fast, if
you need to reclaim a page it should be almost O(1).

Or think jumboframes for that matter.

They can actually run into problems if the hardware wants contiguous

I don't know why you think the fragmentation issues are just magically
fixed. It is hard and inefficient to reclaim larger order blocks (even
with lumpy reclaim), and Mel's patches aren't perfect. Actually, last
time I looked, they needed to keep at least 16MB of pages free to be
reasonably effective (or do we just say that people with less than XMB
of memory shouldn't be accessing these filesystems anyway?), and I'm
not sure if they have been tested for long term stability in the
presence of a reasonable amount of higher order allocations.

SUSE Labs, Novell Inc.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at