Re: [00/17] Large Blocksize Support V3

From: Andrew Morton
Date: Sat Apr 28 2007 - 01:37:04 EST

On Fri, 27 Apr 2007 22:08:17 -0700 (PDT) Christoph Lameter <clameter@xxxxxxx> wrote:

> On Fri, 27 Apr 2007, Andrew Morton wrote:
> > My (repeated) point is that if we populate pagecache with physically-contiguous 4k
> > pages in this manner then bio+block will be able to create much larger SG lists.
> True but the "if" becomes exceedingly rare the longer the system was in
> operation. 64k implies 16 pages in sequence. This is going to be a bit
> difficult to get.

Nonsense. We need higher-order allocations whichever scheme is used.

And lumpy reclaim in the moveable zone should be extremely reliable. It
_should_ be the case that it can only be defeated by excessive use of
mlock. But we've seen no testing to either confirm or refute that.

> Then there is the overhead of handling these pages.
> Which may be not significant given growing processor capabilities in some
> usage cases. In others like a synchronized application running on a large
> number of nodes this is likely introduce random delays between processor
> to processor communication that will significantly impair performance.

Well, who knows.

> And then there is the long list of features that cannot be accomplished
> with such an approach like mounting a volume with large block size,
> handling CD/DVDs, getting rid of various shim layers etc.

There are disadvantages against which this must be traded off.

And if the volume which is mounted with the large page option also has a
lot of small files on it, we've gone and dramatically deoptimised the
user's machine. It would have been better to make the 4k-page
implementation faster, rather than working around existing inefficiencies.

> I'd also like to have much higher orders of allocations for scientific
> applications that require an extremely large I/O rate. For those we
> could f.e. dedicate memory nodes that will only use a very high page
> order to prevent fragmentation. E.g. 1G pages is certainly something that
> lots of our customers would find beneficial (and they are actually
> already using those types of pages in the form of huge pages but with
> limited capabilities).
> But then we are sadly again trying to find another workaround that
> will not get us there and will not allow the flexibility in the
> VM that would make things much easier for lots of usage scenarios.

Your patch *is* a workaround. It's a workaround for small CPU pagesize.
It's a workaround for suboptimal VFS anf filesystem implementations. It's
a workaround for a disk adapter which has suboptimal readahead and
writeback caching implementations.

See? I can spin too.

Fact is, this change has *costs*. And you're completely ignoring them,
trying to spin them away. It ain't working and it never will. I'm seeing
no serious attempt to think about how we can reduce those costs while
retaining most of the benefits.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at