Re: [00/17] Large Blocksize Support V3

From: Andy Whitcroft
Date: Thu Apr 26 2007 - 08:38:18 EST

Nick Piggin wrote:
> Christoph Lameter wrote:
>> On Thu, 26 Apr 2007, Nick Piggin wrote:
>>>> mapping through the radix tree. You just need to change the way the
>>>> filesystem looks up pages.
>>> You didn't think any of the criticisms of higher order page cache size
>>> were valid?
>> They are all known points that have been discussed to death.
> I missed the part where you showed that it was a better solution than
> the alternatives.
>>>> What are the exact requirement you are trying to address?
>>> Block size > page cache size.
>> But what do you mean with it? A block is no longer a contiguous
>> section of memory. So you have redefined the term.
> I don't understand what you mean at all. A block has always been a
> contiguous area of disk.

Lets take Nick's definition of block being a disk based unit for the
moment. That does not change the key contention here, that even with
hardware specifically designed to handle 4k pages that hardware handles
larger contigious areas more efficiently. David Chinner gives us
figures showing major overall throughput improvements from (I assume)
shorter scatter gather lists and better tag utilisation. I am loath to
say we can just blame the hardware vendors for poor design.

>>> You guys have a couple of problems, firstly you need to have ia64
>>> filesystems accessable to x86_64. And secondly you have these
>>> controllers
>>> without enough sg entries for nice sized IOs.
>> This is not sgi specific sorry.
>>> I sympathise, and higher order pagecache might solve these in a way, but
>>> I don't think it is the right way to go, mainly because of the
>>> fragmentation
>>> issues.
>> And you dont care about Mel's work on that level?
> I actually don't like it too much because it can't provide a robust
> solution. What do you do on systems with small memories, or those that
> eventually do get fragmented?
> Actually, I don't know why people are so excited about being able to
> use higher order allocations (I would rather be more excited about
> never having to use them). But for those few places that really need
> it, I'd rather see them use a virtually mapped kernel with proper
> defragmentation rather than putting hacks all through the core code.

Virtually mapping the kernel was considered pretty seriously around the
time SPARSEMEM was being developed. However, that leads to a
non-constant relation for converting kernel virtual addresses to
physical ones which leads to significant complexity, not to mention
runtime overhead.

As a solution to the problem of supplying large pages from the allocator
it seems somewhat unsatisfactory. If no significant other changes are
made in support of large allocations, the process of defragmenting
becomes very expensive. Requiring a stop_machine style hiatus while the
physical copy and replace occurs for any kernel backed memory.

To put it a different way, even with such a full defragmentation scheme
available some sort of avoidance scheme would be highly desirable to
avoid using the very expensive deframentation underlying it.

>>> Increasing PAGE_SIZE, support for block size > page cache size, and
>>> getting
>>> io controllers matched to a 4K page size IMO would be some good ways to
>>> solve these problems. I know they are probably harder...
>> No this has been tried before and does not work. Why should we loose
>> the capability to work with 4k pages just because there is some data
>> that has to be thrown around in quantity? I'd like to have flexibility
>> here.
> Is that a big problem? Really? You use 16K pages on your IPF systems,
> don't you?

To my knowledge, moving to a higher base page size has its advantages in
TLB reach, but brings with it some pretty serious downsides. Especially
in caching small files. Internal fragmentation in the page cache
significantly affecting system performance. So much so that development
is ongoing to see if supporting sub-base-page objects in the buffer
cache could be beneficial.

>> The fragmentation problem is solvable and we already have a solution
>> in mm. So I do not really see a problem there?
> I don't think that it is solved, and I think the heuristics that are
> there would be put under more stress if they become widely used. And
> it isn't only about whether we can get the page or not, but also about
> the cost. Look up Linus's arguments about page colouring, which are
> similar and I also think are pretty valid.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at