Re: [PATCH 3/3] mm,migration: Remove straggling migration PTEs whenpage tables are being moved after the VMA has already moved

From: KAMEZAWA Hiroyuki
Date: Wed Apr 28 2010 - 03:32:48 EST


On Wed, 28 Apr 2010 11:49:44 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:

> On Wed, 28 Apr 2010 04:42:27 +0200
> Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:

> > migrate.c requires rmap to be able to find all ptes mapping a page at
> > all times, otherwise the migration entry can be instantiated, but it
> > can't be removed if the second rmap_walk fails to find the page.
> >
> > So shift_arg_pages must run atomically with respect of rmap_walk, and
> > it's enough to run it under the anon_vma lock to make it atomic.
> >
> > And split_huge_page() will have the same requirements as migrate.c
> > already has.
> >
> > Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>
> Seems good.
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>
> I'll test this and report if I see trouble again.
>
> Unfortunately, I'll have a week of holidays (in Japan) in 4/29-5/05,
> my office is nearly closed. So, please consider no-mail-from-me is
> good information.
>
Here is bad news. When move_page_tables() fails, "some ptes" are moved
but others are not and....there is no rollback routine.

I bet the best way to fix this mess up is
- disable overlap moving of arg pages
- use do_mremap().

But maybe you guys want to fix this directly.
Here is a temporal fix from me. But don't trust me..
==
Subject: fix race between shift_arg_pages and rmap_walk

From: Andrea Arcangeli <aarcange@xxxxxxxxxx>

migrate.c requires rmap to be able to find all ptes mapping a page at
all times, otherwise the migration entry can be instantiated, but it
can't be removed if the second rmap_walk fails to find the page.

So shift_arg_pages must run atomically with respect of rmap_walk, and
it's enough to run it under the anon_vma lock to make it atomic.

And split_huge_page() will have the same requirements as migrate.c
already has.

And, when moving overlapping ptes by move_page_tables(), it's cannot
be roll-backed as mremap does. This patch changes move_page_tables()'s
behavior and if it failes, no ptes are moved.

Changelog:
- modified move_page_tables() to do atomic pte moving because
"some ptes are moved but others are not" is critical for rmap_walk().
- free pgtables at failure rather than give it all to do_exit().
If not, objrmap will be inconsitent until exit() frees all.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
---

---
fs/exec.c | 67 +++++++++++++++++++++++++++++++++++++-----------------------
mm/mremap.c | 28 ++++++++++++++++++++++---
2 files changed, 67 insertions(+), 28 deletions(-)

Index: mel-test/fs/exec.c
===================================================================
--- mel-test.orig/fs/exec.c
+++ mel-test/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
#include <linux/pipe_fs_i.h>
+#include <linux/rmap.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -503,7 +504,10 @@ static int shift_arg_pages(struct vm_are
unsigned long length = old_end - old_start;
unsigned long new_start = old_start - shift;
unsigned long new_end = old_end - shift;
+ unsigned long moved_length;
struct mmu_gather *tlb;
+ int ret;
+ unsigned long unused_pgd_start, unused_pgd_end, floor, ceiling;

BUG_ON(new_start > new_end);

@@ -517,41 +521,54 @@ static int shift_arg_pages(struct vm_are
/*
* cover the whole range: [new_start, old_end)
*/
- if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
- return -ENOMEM;
+ spin_lock(&vma->anon_vma->lock);
+ vma->vm_start = new_start;

- /*
- * move the page tables downwards, on failure we rely on
- * process cleanup to remove whatever mess we made.
- */
if (length != move_page_tables(vma, old_start,
- vma, new_start, length))
- return -ENOMEM;
-
- lru_add_drain();
- tlb = tlb_gather_mmu(mm, 0);
- if (new_end > old_start) {
+ vma, new_start, length)) {
+ vma->vm_start = old_start;
/*
- * when the old and new regions overlap clear from new_end.
- */
- free_pgd_range(tlb, new_end, old_end, new_end,
- vma->vm_next ? vma->vm_next->vm_start : 0);
+ * we have to free [new_start, new_start+length] pgds
+ * which we've allocated above.
+ */
+ if (new_end > old_start) {
+ unused_pgd_start = new_start;
+ unused_pgd_end = old_start;
+ } else {
+ unused_pgd_start = new_start;
+ unused_pgd_end = new_end;
+ }
+ floor = new_start;
+ ceiling = old_start;
+ ret = -ENOMEM:
} else {
- /*
- * otherwise, clean from old_start; this is done to not touch
- * the address space in [new_end, old_start) some architectures
- * have constraints on va-space that make this illegal (IA64) -
- * for the others its just a little faster.
- */
- free_pgd_range(tlb, old_start, old_end, new_end,
- vma->vm_next ? vma->vm_next->vm_start : 0);
+ if (new_end > old_start) {
+ unused_pgd_start = new_end;
+ unused_pgd_end = old_end;
+ } else {
+ unused_pgd_start = old_start;
+ unused_pgd_end = old_end;
+ }
+ floor = new_end;
+ if (vma->vm_next)
+ ceiling = vma->vm_next->vm_start;
+ else
+ ceiling = 0;
+ ret = 0;
}
+ spin_unlock(&vma->anon_vma->lock);
+
+ lru_add_drain();
+ tlb = tlb_gather_mmu(mm, 0);
+ /* Free unnecessary PGDS */
+ free_pgd_range(tlb, unused_pgd_start, unused_pgd_end, floor, ceiling);
tlb_finish_mmu(tlb, new_end, old_end);

/*
* Shrink the vma to just the new range. Always succeeds.
*/
- vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+ if (!ret)
+ vma->vm_end = new_end;

return 0;
}
Index: mel-test/mm/mremap.c
===================================================================
--- mel-test.orig/mm/mremap.c
+++ mel-test/mm/mremap.c
@@ -134,22 +134,44 @@ unsigned long move_page_tables(struct vm
{
unsigned long extent, next, old_end;
pmd_t *old_pmd, *new_pmd;
+ unsigned long from_addr, to_addr;

old_end = old_addr + len;
flush_cache_range(vma, old_addr, old_end);

- for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
+ /* At first, copy required pmd in the range */
+ for (from_addr = old_addr, to_addr = new_addr;
+ from_addr < old_end; from_addr += extent, to_addr += extent) {
cond_resched();
next = (old_addr + PMD_SIZE) & PMD_MASK;
if (next - 1 > old_end)
next = old_end;
extent = next - old_addr;
- old_pmd = get_old_pmd(vma->vm_mm, old_addr);
+ old_pmd = get_old_pmd(vma->vm_mm, from_addr);
if (!old_pmd)
continue;
- new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+ new_pmd = alloc_new_pmd(vma->vm_mm, to_addr);
if (!new_pmd)
break;
+ next = (to_addr + PMD_SIZE) & PMD_MASK;
+ if (extent > next - to_addr)
+ extent = next - to_addr;
+ }
+ /* -ENOMEM ? */
+ if (from_addr < old_end) /* the caller must free remaining pmds. */
+ return 0;
+
+ for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
+ cond_resched();
+ next = (old_addr + PMD_SIZE) & PMD_MASK;
+ if (next - 1 > old_end)
+ next = old_end;
+ extent = next - old_addr;
+ old_pmd = get_old_pmd(vma->vm_mm, old_addr);
+ if (!old_pmd)
+ continue;
+ new_pmd = get_new_pmd(vma->vm_mm, new_addr);
+ BUG_ON(!new_pmd);
next = (new_addr + PMD_SIZE) & PMD_MASK;
if (extent > next - new_addr)
extent = next - new_addr;









--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/