On Fri, 2012-07-06 at 12:38 -0400, Rik van Riel wrote:
4. Putting a lot of pages in the swap cache ends up allocating
swap space. This means this NUMA migration scheme will only
work on systems that have a substantial amount of memory
represented by swap space. This is highly unlikely on systems
with memory in the TB range. On smaller systems, it could drive
the system out of memory (to the OOM killer), by "filling up"
the overflow swap with migration pages instead.
5. In the long run, we want the ability to migrate transparent
huge pages as one unit. The reason is simple, the performance
penalty for running on the wrong NUMA node (10-20%) is on the
same order of magnitude as the performance penalty for running
with 4kB pages instead of 2MB pages (5-15%).
Breaking up large pages into small ones, and having khugepaged
reconstitute them on a random NUMA node later on, will negate
the performance benefits of both NUMA placement and THP.
When I originally posted the "migrate on fault" series, I posted a
separate series with a "migration cache" to avoid the use of swap space
for lazy migration: http://markmail.org/message/xgvvrnn2nk4nsn2e.
The migration cache was originally implemented by Marcello Tosatti for
the old memory hotplug project:
http://marc.info/?l=linux-mm&m=109779128211239&w=4.
The idea is that you don't need swap space for lazy migration, just an
"address_space" where you can park an anon VMA's pte's while they're
"unmapped" to cause migration faults. Based on a suggestion from
Christoph Lameter, I had tried to hide the migration cache behind the
swap cache interface to minimize changes mainly in do_swap_page and
vmscan/reclaim. It seemed to work, but the difference in reference
count semantics for the mig cache -- entry removed when last pte
migrated/mapped -- makes coordination with exit teardown, uh, tricky.