Re: [RFC][PATCH 03/26] mm, mpol: add MPOL_MF_LAZY ...

From: Lee Schermerhorn
Date: Fri Jul 06 2012 - 16:05:46 EST


On Fri, 2012-07-06 at 12:38 -0400, Rik van Riel wrote:
> On 03/23/2012 07:50 AM, Mel Gorman wrote:
> > On Fri, Mar 16, 2012 at 03:40:31PM +0100, Peter Zijlstra wrote:
> >> From: Lee Schermerhorn<Lee.Schermerhorn@xxxxxx>
> >>
> >> This patch adds another mbind() flag to request "lazy migration".
> >> The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
> >> pages are simply unmapped from the calling task's page table ['_MOVE]
> >> or from all referencing page tables [_MOVE_ALL]. Anon pages will first
> >> be added to the swap [or migration?] cache, if necessary. The pages
> >> will be migrated in the fault path on "first touch", if the policy
> >> dictates at that time.
> >>
> >> <SNIP>
> >>
> >> @@ -950,6 +950,98 @@ static int unmap_and_move_huge_page(new_
> >> }
> >>
> >> /*
> >> + * Lazy migration: just unmap pages, moving anon pages to swap cache, if
> >> + * necessary. Migration will occur, if policy dictates, when a task faults
> >> + * an unmapped page back into its page table--i.e., on "first touch" after
> >> + * unmapping. Note that migrate-on-fault only migrates pages whose mapping
> >> + * [e.g., file system] supplies a migratepage op, so we skip pages that
> >> + * wouldn't migrate on fault.
> >> + *
> >> + * Pages are placed back on the lru whether or not they were successfully
> >> + * unmapped. Like migrate_pages().
> >> + *
> >> + * Unline migrate_pages(), this function is only called in the context of
> >> + * a task that is unmapping it's own pages while holding its map semaphore
> >> + * for write.
> >> + */
> >> +int migrate_pages_unmap_only(struct list_head *pagelist)
> >
> > I'm not properly reviewing these patches at the moment but am taking a
> > quick look as I play some catch up on linux-mm.
> >
> > I think it's worth pointing out that this potentially will confuse
> > reclaim. Lets say a process is being migrated to another node and it
> > gets unmapped like this then some heuristics will change.
> >
> > 1. If the page was referenced prior to the unmapping then it should be
> > activated if the page reached the end of the LRU due to the checks
> > in page_check_references(). If the process has been unmapped for
> > migrate-on-fault, the pages will instead be reclaimed.
> >
> > 2. The heuristic that applies pressure to slab pages if pages are mapped
> > is changed. Prior to migrate-on-fault sc->nr_scanned is incremented
> > for mapped pages to increase the number of slab pages scanned to
> > avoid swapping. During migrate-on-fault, this pressure is relieved
> >
> > 3. zone_reclaim_mode in default mode will reclaim pages it would
> > previously have skipped over. It potentially will call shrink_zone more
> > for the local node than falling back to other nodes because it thinks
> > most pages are unmapped. This could lead to some trashing.
> >
> > It may not even be a major problem but it's worth thinking about. If it
> > is a problem, it will be necessary to account for migrate-on-fault pages
> > similar to mapped pages during reclaim.
>
> I can see other serious issues with this approach:
>
> 4. Putting a lot of pages in the swap cache ends up allocating
> swap space. This means this NUMA migration scheme will only
> work on systems that have a substantial amount of memory
> represented by swap space. This is highly unlikely on systems
> with memory in the TB range. On smaller systems, it could drive
> the system out of memory (to the OOM killer), by "filling up"
> the overflow swap with migration pages instead.
> 5. In the long run, we want the ability to migrate transparent
> huge pages as one unit. The reason is simple, the performance
> penalty for running on the wrong NUMA node (10-20%) is on the
> same order of magnitude as the performance penalty for running
> with 4kB pages instead of 2MB pages (5-15%).
>
> Breaking up large pages into small ones, and having khugepaged
> reconstitute them on a random NUMA node later on, will negate
> the performance benefits of both NUMA placement and THP.
>
> In short, while this approach made sense when Lee first proposed
> it several years ago (with smaller memory systems, and before Linux
> had transparent huge pages), I do not believe it is an acceptable
> approach to NUMA migration any more.
>
> We really want something like PROT_NONE or PTE_NUMA page table
> (and page directory) entries, so we can avoid filling up swap
> space with migration pages and have the possibility of migrating
> transparent huge pages in one piece at some point.
>
> In other words, NAK to this patch
>

When I originally posted the "migrate on fault" series, I posted a
separate series with a "migration cache" to avoid the use of swap space
for lazy migration: http://markmail.org/message/xgvvrnn2nk4nsn2e.

The migration cache was originally implemented by Marcello Tosatti for
the old memory hotplug project:
http://marc.info/?l=linux-mm&m=109779128211239&w=4.

The idea is that you don't need swap space for lazy migration, just an
"address_space" where you can park an anon VMA's pte's while they're
"unmapped" to cause migration faults. Based on a suggestion from
Christoph Lameter, I had tried to hide the migration cache behind the
swap cache interface to minimize changes mainly in do_swap_page and
vmscan/reclaim. It seemed to work, but the difference in reference
count semantics for the mig cache -- entry removed when last pte
migrated/mapped -- makes coordination with exit teardown, uh, tricky.

Regards,
Lee



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/