Re: [PATCH] allocate page caches pages in round robin fasion

From: Ray Bryant
Date: Fri Aug 13 2004 - 12:33:11 EST


Hi Martin,

Martin J. Bligh wrote:
<snip>


Does that actually happen though? Looking at the current code makes me think
it'll keep some pages free on all nodes at all times, and if kswapd does
it's job, we'll never fall back across nodes. Now ... I think that's broken,
but I think that's what currently happens - that was what we discussed at
KS ... I might be misreading it though, I should test it.

Even if that's not true, allocating all your most recent stuff off-node is
still crap (so either way, I'd agree the current situation is broken), but
I don't think the solution is to push ALL your accesses (with n-1/n probability)
off-node ... we need to be more careful than that ...


I think you're missing out on the typical workload situation where we run into this problem. Just to make things a bit more specific, lets assume we are on a 128 node (256 P) system, with 4 GB per node. Let's assume that we have a 100 GB data file that we access periodically during the run, accesses to that data file are done in random access fashion from each node.

The program starts out by reading in the data file, then forks off 256 copies of itself, and allocates 1 GB per CPU of local storage via MPOL_DEFAULT. All of those pages had better be in local or the computation will be unbalanced and run as slowly as the slowest node.

As I read the __alloc_pages() code, those 100 GB of data pages will be allocated on the node that did the file read; when that node fills up, we will spill the allocation to adjacent nodes (this is the first loop of __alloc_pages(), kswapd doesn't get invoked until that first loop fails).

(kswapd() doesn't get invoked until all of the zones in the zonelist are full.
All of memory is in that zonelist, unless we have cpusets enabled. So the priority is to spill off node first and then swap() second.)

Now the application starts allocating its 2 GB of local, and the nodes where the page cache was allocated all get non-local pages allocated. (Once again, this happens in the first loop of __alloc_pages().) The ratio of accesses to local data pages versus access to remote page cache pages is unfavorable for local page cache allocation, since the page cache pages are accessed at a tiny fraction of the rate of the data pages.

Now I suppose you could argue that the application should fork first and then read in 1/256th of the data on each cpu. The problems with this, in general, are twofold:

(1) It could have been a simple "cp" in a startup script that did the read..
We can't fix all of those things as well.
(2) The application may be an ISV's program that is not NUMA aware. We can
fix most of that by wrappering the program with control scripts, but
requiring the ISV to build a specific NUMA aware version of the binary
for Altix is oftentimes not feasible. (And because the allocation
policy is MPOL_DEFAULT, the application doesn't have to have NUMA API
calls imbedded in the program.)


If we round-robin it ... surely 7/8 of your data (on your 8 node machine)
will ALWAYS be off-node ? I thought we discussed this at KS/OLS - what is
needed is to punt old pages back off onto another node, rather than
swapping them out. That way all your pages are going to be local.


Surely you can't be suggesting that I migrate a page cache page to a local node just to read it? If file accesses are random and global, won't you end up just bouncing page cache pages hither and yon? Surely it is better just to copy the data remotely to the current node and leave it where it is?

(YMMV -- all of these tradeoffs are clearly workload dependent.)

That gets complicated pretty quickly I think. We don't want to constantly shuffle pages between nodes with kswapd, and there's also the problem of deciding when to do it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/