[PATCH 2/2] mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks

From: Mel Gorman
Date: Fri Apr 30 2010 - 13:56:24 EST


Page migration requires rmap to be able to find all ptes mapping a page
at all times, otherwise the migration entry can be instantiated, but it
is possible to leave one behind if the second rmap_walk fails to find
the page. If this page is later faulted, migration_entry_to_page() will
call BUG because the page is locked indicating the page was migrated by
the migration PTE not cleaned up. For example

[ 511.201534] kernel BUG at include/linux/swapops.h:105!
[ 511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[ 511.201534] last sysfs file: /sys/block/sde/size
[ 511.201534] CPU 0
[ 511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[ 511.888526]
[ 511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[ 511.888526] RIP: 0010:[<ffffffff811094ff>] [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[ 512.173545] RSP: 0018:ffff880037b979d8 EFLAGS: 00010246
[ 512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[ 512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[ 512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[ 512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[ 512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[ 512.380001] FS: 00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[ 512.380001] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[ 512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
[ 513.068415] Stack:
[ 513.068415] ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
[ 513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
[ 513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
[ 513.228068] Call Trace:
[ 513.228068] [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[ 513.228068] [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
[ 513.228068] [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[ 513.720755] [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 513.789278] [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[ 513.851506] [<ffffffff813099b5>] page_fault+0x25/0x30
[ 513.851506] [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[ 513.851506] [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
[ 513.851506] [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[ 513.851506] [<ffffffff8111329b>] search_binary_handler+0x173/0x313
[ 513.851506] [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
[ 513.851506] [<ffffffff81114896>] do_execve+0x219/0x30a
[ 513.851506] [<ffffffff8111887f>] ? getname+0x14d/0x1b3
[ 513.851506] [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[ 514.483501] [<ffffffff8100320a>] stub_execve+0x6a/0xc0
[ 514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
[ 514.704292] RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[ 514.808221] RSP <ffff880037b979d8>
[ 514.906179] ---[ end trace 4f88495edc224d6b ]---

There is a race between shift_arg_pages and migration that triggers this bug.
A temporary stack is setup during exec and later moved. If migration moves
a page in the temporary stack and the VMA is then removed before migration
completes, the migration PTE may not be found leading to a BUG when the
stack is faulted.

Ideally, shift_arg_pages must run atomically with respect of rmap_walk
by holding the anon_vma lock but this is problematic as pages must be
allocated for page tables which cannot happen with a spinlock held. Instead,
this patch skips processes in exec by making an assumption that a VMA with
stack-flags set and a map_count == 1 is in exec that hasn't finalised the
temporary stack yet so don't migrate the pages. Memory hot-remove will try
again, sys_move_pages() wouldn't be operating during exec() time and memory
compaction will just continue to another page without concern.

[kamezawa.hiroyu@xxxxxxxxxxxxxx: Idea for having migration skip temporary stacks]
Signed-off-by: Mel Gorman <mel@xxxxxxxxx>
---
mm/rmap.c | 30 +++++++++++++++++++++++++++++-
1 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index f95b66d..3bb6c9e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1143,6 +1143,20 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
return ret;
}

+static bool is_vma_temporary_stack(struct vm_area_struct *vma)
+{
+ int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
+
+ if (!maybe_stack)
+ return false;
+
+ /* If only the stack is mapped, assume exec is in progress */
+ if (vma->vm_mm->map_count == 1)
+ return true;
+
+ return false;
+}
+
/**
* try_to_unmap_anon - unmap or unlock anonymous page using the object-based
* rmap method
@@ -1171,7 +1185,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)

list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
- unsigned long address = vma_address(page, vma);
+ unsigned long address;
+
+ /*
+ * During exec, a temporary VMA is setup and later moved.
+ * The VMA is moved under the anon_vma lock but not the
+ * page tables leading to a race where migration cannot
+ * find the migration ptes. Rather than increasing the
+ * locking requirements of exec(), migration skips
+ * temporary VMAs until after exec() completes.
+ */
+ if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
+ is_vma_temporary_stack(vma))
+ continue;
+
+ address = vma_address(page, vma);
if (address == -EFAULT)
continue;
ret = try_to_unmap_one(page, vma, address, flags);
--
1.6.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/