Re: [00/17] Large Blocksize Support V3

From: Mel Gorman
Date: Thu Apr 26 2007 - 08:23:44 EST


On Thu, 26 Apr 2007, Eric W. Biederman wrote:

mel@xxxxxxxxx (Mel Gorman) writes:

On (26/04/07 18:55), Nick Piggin didst pronounce:
Mel Gorman wrote:
On (26/04/07 16:50), Nick Piggin didst pronounce:

Fragmentation is the problem. The anti-frag patches don't actually
guarantee anything about fragmentation, and even if they did, then


The grouping pages by mobility do not guarantee anything but the memory
partition (kernelcore= boot parameter) does give hard guarantees about
the amount of memory that is "movable". Of course, the partition requires
configuration at boot-time so it's less than ideal but it does give hard
guarantees.

For the hugepages people, I can understand that's a solution.

I think that's the closest you have ever come to saying fragmentation
avoidance is not a terrible idea :)

But that's
the last thing you want to do on a system with a limited amount of memory,
or a regular Joe's desktop/server.


Regular Joe is not going to be creating a filesystem with large blocks or
overly concerned with saturating all the disks hanging off his RAID array.
At most, he'll care about faster DVD writing and considering the number and
duration of those allocations, the system will be able to handle it. So I
don't think Regular Joe will generally care.

The practical question is if this a special purpose hack, or is
this general infrastructure that we expect filesystems to seriously
use.

If this is a special purpose hack then the larger pages and
fragmentation are minor issues, but it's very existence is a problem.

If this is not a special purpose hack and people do use this a lot
then the fragmentation is a problem.


Like so many other things, I think this would start as something used by the minority of users with the hardware that requires this sort of feature and slowly bubbles down until it appears on day-to-day machines. I wouldn't describe it as a hack but it's fair to say that it'll start as special purpose. It won't stay that way forever.

Your reply indicates that fragmentation is a concern, as does the
initial posting and the more positive threads. Therefore either
this is a special purpose hack in which case it's existence is
questionable or it fails a general solution and it's existence
is questionable.


Or it'll start as a special purpose feature and evolve to be a general solution.


Indeed but then you have to deal with internal fragmentation
for pages-larger-than-TLB-page. I'm not saying it's wrong but it does
come with it's own set of issues.

None of them is perfect (the ways to increase the size of pagecache pages,
that is).

I think in the long term, TLB page sizes will probably increase a little
bit... but if a given page size is "good enough" for a CPU, they really
should be good enough for other hardware. I mean, come on, the CPU's TLB
has to have a good hit ratio and handle several lookups per cycle with a
3-cycle latency on 3GHz+ hardware... surely a an IO controller's
scatter-gather engine or IOMMU that has to do a few lookups per disk IO
is nowhere near so critical as a CPU's datapath: just add a few more
entries to it, they've already got hundreds of megs of cache, so that
isn't an issue either.


I cannot speak with authority on what IO controllers are really capable
of so maybe someone else will comment on this more.

Well here is some reality. Using an FPGA (slow by definition)
connected to a hypertransport interface I have exceeded a gigabyte a
second, without even setting up scatter gather.

All of the I/O on all of the peripheral busses happens in sub
1 page chunks.

At the rates we are talking for disks the issue is really how to
bury the disk latency. For fast arrays how do you submit enough
request to bury the latency.

I can't possibly believe any of this is about the cost of processing
a request, but rather the problem that some devices don't have a large
enough pool of requests to keep them busy if you submit the requests
in a 4K page sizes.

This all sounds like a device design issue rather than anything more
significant, and it doesn't sound like a long term trend. Market
pressure should fix the hardware.


Eric


--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/