Re: [00/17] Large Blocksize Support V3

From: Nick Piggin
Date: Fri Apr 27 2007 - 07:42:10 EST

Paul Mackerras wrote:
Andrew Morton writes:

If x86 had larger pagesize we wouldn't be seeing any of this. It is a workaround
for present-generation hardware.

Unfortunately, it's not really practical to increase the page size
very much on most systems, because you end up wasting a lot of space
in the page cache. So there is a tension between wanting a small page
size so your page cache uses memory efficiently, and wanting a large
page size so the TLB covers more address space and your programs run
faster (not to mention other benefits such as the kernel having to
manage fewer pages, and I/O being done in bigger chunks).

Thus there is not really any single page size that suits all workloads
and machines. With distros wanting to just have a single kernel per
architecture, and the fact that the page size is a compile-time
constant, we currently end up having to pick one size and just put up
with the fact that it will suck for some users. We currently have
this situation on ppc64 now that POWER5+ and POWER6 machines have
hardware support for 64k pages as well as 4k pages.

So I can see a few different options:

(a) Keep things more or less as they are now and just wear the fact
that we will continue to show lower performance than certain
proprietary OSes, or

(b) Somehow manage to make the page size a variable rather than a
compile-time constant, and pick a suitable page size at boot time
based on how much memory the machine has, or something. I looked at
implementing this at one point and recoiled in horror. :)

(c) Make the page cache able to use small pages for small files and
large pages for large files. AIUI this is basically what Christoph is

Option (a) isn't very palatable to me (nor I expect, Christoph :)
since it basically says that Linux is very much focussed on the
embedded and desktop end of things and isn't really suitable as a
high-performance OS for large SMP systems. I don't want to believe
that. ;)

Option (b) would be a bit of an ugly hack.

Which leaves option (c) - unless you have a further option. So I have
to say I support Christoph on this, at least as far as the general
principle is concerned.

For the TLB issue, higher order pagecache doesn't help. If distros
ship with a 4K page size on powerpc, and use some larger pages in
the pagecache, some people are still going to get angry because
they wanted to use 64K pages... But I agree 64K pages is too big
for most things anyway, and 16 would be better as a default (which
hopefully x86-64 will get one day).

Anyway, for io performance, there are alternatives, dispite what
some people seem to be saying. We can submit larger sglists to the
device for larger ios, which Jens is looking at (which could help
all types of workloads, not just those with sequential large file

After that, I'd find it amusing if HBAs worth thousands of $ have
trouble looking up sglists at the relatively glacial pace that IO
requires, and/or can't spare a few more K for reasonable sglist
sizes, but if that is really the case, then we could use iommus
and/or just attempt to put physically contiguous pages in pagecache,
rather than require it.

SUSE Labs, Novell Inc.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at