"Christoph" == Christoph Hellwig <hch@xxxxxxxxxxxxx> writes:
Christoph> which has a nopage routine that calls remap_pfn_range from
Christoph> ->nopage for uncached memory that's not part of the mem
Christoph> map. Because ->nopage wants to return a struct page * he's
Christoph> allocating a normal kernel page and actually returns that
Christoph> one - to get the page he wants into the pagetables his does
Christoph> all the pagetable manipulation himself before (See the
Christoph> glory details of pagetable walks and modification inside a
Christoph> driver in the patch above).
Christoph> I don't think these hacks are acceptable for a driver,
Christoph> especially as the problem can easily be solved by calling
Christoph> remap_pfn_range in ->mmap - except SGI also wants node
Let me try and provide some more background then.
Simply doing remap_pfn_range in the mmap call doesn't work for large
Take the example of a 2048 CPU system (512 CPUs per partition/machine
- each machine running it's own OS) running an MPI application
across all 2048 CPUs using cross coherency domain traffic.
A standard application will allocate 56 DDQs per thread (the DDQs are
used for synchronization and allocated through the mspec driver) which
translates to having 126976 uncached cache lines reserved or 992 pages
per worker thread. The controlling thread on each partition will mmap
the entire DDQ space up front and then fork off the workers who will
then go and touch their pages. With the current approach by the driver
this means that if you have two threads per node you will end up with
~32MB of uncached memory allocated per node.
Alternatively doing this at mmap time having 512 worker threads per
partition, the result is ~8GB (992 * 16K * 512) of uncached memory all
allocated by the master thread on each machine.
A typical system configuration is 4GB or 8GB of RAM per node. This
means that by using the remap_pfn_range at mmap time approach and the
kernel's standard overhead you end up completely starving the first
couple of nodes of memory on each partition.
Combine this with the effect of all synchronization traffic hitting
the same node, you effectively end up with 512 CPUs all constantly
hammering the same memory controller to death.
FWIW, an initial implementation of the driver was done by someone
within SGI, prior to me having anything to do with it. It was using
the remap_pfn_range at mmap time approach and it was noticed then that
16 worker threads was pretty much enough to overwhelm a node.
Having the page allocations and drop ins on a first touch basis is
consistent with what is done for cached memory and seems a pretty
reasonable approach to me. Sure it isn't particularly pretty to use
the ->nopage approach, nobody disagrees with you there, but what is