Re: [00/17] Large Blocksize Support V3

From: Mel Gorman
Date: Fri Apr 27 2007 - 09:07:37 EST

On (27/04/07 20:05), Nick Piggin didst pronounce:
> Christoph Hellwig wrote:
> >On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote:
> >
> >>>Well maybe you could explain what you want. Preferably without
> >>>redefining the established terms?
> >>
> >>Support for larger buffers than page cache pages.
> >
> >
> >I don't think you really want this :) The whole non-pagecache I/O
> >path before 2.3 was a toal pain just because it used buffers to drive
> >I/O. Add to that buffers bigger than a page and you add another
> >two mangnitudes of complexity. If you want to see a mess like that
> >download on of the eary XFS/Linux releases that had an I/O path
> >like that. I _really_ _really_ don't want to go there.
> I'm not actually suggesting to add anything like that. But I think
> larger blocks can be doable while retaining the "buffer" layer as a
> relatively simple pagecache to block translation.
> Anyway, I'm working on patches... they might crash and burn, but we
> might have something to talk about later.
> >Linux has a long tradition of trading a tiny bit of efficieny for
> >much cleaner code, and I'd for 100% go down Christoph's route here.
> >Then again I'd actually be rather surprised if > page buffers
> >were more efficient - you'd run into shitloads over overhead due to
> >them beeing non-contingous like calling vmap all over the place,
> >reprogramming iommus to at least make them look virtually contingous [1],
> >etc..
> I still think hardware should work reasonably well with 4K pages. The
> SGI io controllers and/or the Linux block layer that doesn't allow more
> than 128 sg entries is clearly suboptimal if the hardware runs twice as
> fast with 2MB submissions.
> >I also don't quite get what your problem with higher order allocations
> >are. order 1 allocations are generally just fine, and in fact
> >thread stacks are >= oder 1 on most architectures. And if the pagecache
> >uses higher order allocations that means we'll finally fix our problems
> >with them, which we have to do anyway. Workloads continue to grow and
> >with them the kernel overhead to manage them, while the pagesize for
> >many architectures is fixed. So we'll have to deal with order 1
> >and order 2 allocations better just for backing kmalloc and co.
> The pagecache is much bigger and often a lot more activity than these
> other things though. Also, the more things you add to higher order
> allocations, the more pressure you have.
> I like PAGE_SIZE pagecache, because it is reliable and really fast, if
> you need to reclaim a page it should be almost O(1).
> >Or think jumboframes for that matter.
> They can actually run into problems if the hardware wants contiguous
> memory.
> I don't know why you think the fragmentation issues are just magically
> fixed. It is hard and inefficient to reclaim larger order blocks (even
> with lumpy reclaim), and Mel's patches aren't perfect. Actually, last
> time I looked, they needed to keep at least 16MB of pages free to be
> reasonably effective (or do we just say that people with less than XMB
> of memory shouldn't be accessing these filesystems anyway?)

It'll work without adjusting the min_free_kbytes at all. The 16MB free had
better results after fragmentation stress tests but this was a few percent
of memory when allocating as huge pages as opposed to it falling apart. The
success rates were still way way higher than the vanilla kernel.

>, and I'm
> not sure if they have been tested for long term stability in the
> presence of a reasonable amount of higher order allocations.

I don't have a sample workload that has reasonable amount of higher order
allocations over longer period of time. When the next -mm comes out, SLUB will
be able to use high-order pages so I'll boot my machine with less memory to
pressure it more. Assuming the kernel boots on my desktop machine, I should
get some idea of what its long-term behaviour looks like.

Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at