Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

From: Vlastimil Babka
Date: Tue Jan 14 2014 - 06:05:17 EST


On 01/13/2014 03:03 PM, Vlastimil Babka wrote:
On 01/10/2014 06:48 PM, Motohiro Kosaki wrote:
diff --git a/mm/rmap.c b/mm/rmap.c
index 068522d..b99c742 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long
cursor, unsigned int *mapcount,
BUG_ON(!page || PageAnon(page));

if (locked_vma) {
- mlock_vma_page(page); /* no-op if already
mlocked */
- if (page == check_page)
+ if (page == check_page) {
+ /* we know we have check_page locked */
+ mlock_vma_page(page);
ret = SWAP_MLOCK;
+ } else if (trylock_page(page)) {
+ /*
+ * If we can lock the page, perform mlock.
+ * Otherwise leave the page alone, it will be
+ * eventually encountered again later.
+ */
+ mlock_vma_page(page);
+ unlock_page(page);
+ }
continue; /* don't unmap */
}

I audited all related mm code. However I couldn't find any race that it can close.

Well, I would say the lock here closes the race with page migration, no? (As discussed below)

First off, current munlock code is crazy tricky.

Oops, that's actually a result of my patches to speed up munlock by batching pages (since 3.12).

munlock
down_write(mmap_sem)
do_mlock()
mlock_fixup
munlock_vma_pages_range
__munlock_pagevec
spin_lock_irq(zone->lru_lock)
TestClearPageMlocked(page)
del_page_from_lru_list
spin_unlock_irq(zone->lru_lock)
lock_page
__munlock_isolated_page
unlock_page

up_write(mmap_sem)

Look, TestClearPageMlocked(page) is not protected by lock_page.

Right :( That's my fault, when developing the patch series I didn't see the page
migration race, and it seemed that lock is only needed to protect the rmap operations
in __munlock_isolated_page()

But this is still
safe because Page_mocked mean one or more vma marked VM_LOCKED then we
only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them.

And,

spin_lock_irq(zone->lru_lock)
del_page_from_lru_list
spin_unlock_irq(zone->lru_lock)

This idiom ensures I or someone else isolate the page from lru and isolated pages
will be put back by putback_lru_page in anycase. So, the pages will move the right
lru eventually.

And then, taking page-lock doesn't help to close vs munlock race.

On the other hands, page migration has the following call stack.

some-caller [isolate_lru_page]
unmap_and_move
__unmap_and_move
trylock_page
try_to_unmap
move_to_new_page
migrate_page
migrate_page_copy
unlock_page

The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock.

However, none of that protects against setting PG_mlocked in try_to_unmap_cluster() -> mlock_vma_page(),
or clearing PG_mlocked in __munlock_pagevec().

The race I have in mind for munlock is:

CPU1 does page migration
some-caller [isolate_lru_page]
unmap_and_move
__unmap_and_move
trylock_page
try_to_unmap
move_to_new_page
migrate_page
migrate_page_copy
mlock_migrate_page - transfers PG_mlocked from old page to new page

CPU2 does munlock:
munlock
down_write(mmap_sem)
do_mlock()
mlock_fixup
munlock_vma_pages_range
__munlock_pagevec
spin_lock_irq(zone->lru_lock)
TestClearPageMlocked(page) - here it finds PG_mlocked already cleared
so it stops, but meanwhile new page already has PG_mlocked set and will
stay inevictable

As Mel pointed out to me, page lock in munlock alone would not help here
anyway, because munlock would just wait for migration to unlock the old page and then still fail to clear a flag that has been transferred by migration to the new page. So if there was a race, it would be older than __munlock_pagevec() removing TestClearPageMlocked(page) from the page_lock protected section.

But then I noticed that migration bails out for pages that have elevated pins, and this is done after making the page unreachable for mlock/munlock via pte migration entries. So this alone should prevent a race with m(un)lock, with three possible scenarios:
- m(un)lock sees the migration entry and waits for migration to
complete (via follow_page_mask), operates on the new page
- m(un)lock gets the page reference before migration entry is set,
finishes before migration checks the page pin count, migration
transfers the new PG_mlocked state to the new page
- m(un)lock doesn't finish that quickly, migration bails out, m(un)lock
changes the old page that is mapped back


Page fault path take lock page and doesn't use page isolation. This is correct.
try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too.

I don't see where try_to_unmap_cluster() has guaranteed that pages other than check_page are isolated?
(I might just be missing that). So in the race example above, CPU2 could be doing try_to_unmap_cluster()
and set PG_mlocked on old page, with no effect on the new page. Not fatal for the design of lazy mlocking,
but a distorted accounting anyway.

Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against
page migration, but not lru manipulation.

if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) {
lock_page(old_page); /* LRU manipulation */
munlock_vma_page(old_page);
unlock_page(old_page);
}


So the protection is actually for rmap operations in try_to_munlock.
The comment could be removed from the caller here and appear just at the PageLocked assertion in munlock_vma_page().

But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected
page lock. If so, I propose to add PageMlocked() like this

} else if (!PageMlocked() && trylock_page(page)) {
/*
* If we can lock the page, perform mlock.
* Otherwise leave the page alone, it will be
* eventually encountered again later.
*/
mlock_vma_page(page);
unlock_page(page);

This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may
do the right job and will fix the mistake.

Anyway, I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush.

If we agree that page migration is safe as explained above, then removing the PageLocked assertion from mlock_vma_page is again an option, instead of making sure that all callers lock?

Yeah but there's the new issue with __munlock_pagevec() :( Perhaps a more serious one as it could leave pages inevictable
when they shouldn't be. I'm not sure if the performance benefits of that could be preserved with full page locking.
Another option would be that page migration would somehow deal with the race, or just leave the target pages without
PG_mlocked and let them be dealt with later.
But if we go with the rule that page lock must protect PG_mlocked, then there's also clear_page_mlock() to consider.
And finally, we could then at least replace the atomic test and set with faster non-atomic variants.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/