Re: [BUGFIX][mm][PATCH] fix migration race in rmap_walk

From: Minchan Kim
Date: Mon Apr 26 2010 - 00:00:57 EST


Hi, Mel.

On Sat, Apr 24, 2010 at 12:58 AM, Mel Gorman <mel@xxxxxxxxx> wrote:
> On Fri, Apr 23, 2010 at 10:59:22AM +0100, Mel Gorman wrote:
>> On Fri, Apr 23, 2010 at 12:01:48PM +0900, KAMEZAWA Hiroyuki wrote:
>> > This patch itself is for -mm ..but may need to go -stable tree for memory
>> > hotplug. (but we've got no report to hit this race...)
>> >
>>
>> Only because it's very difficult to hit. Even when running compaction
>> constantly, it can take anywhere between 10 minutes and 2 hours for me
>> to reproduce it.
>>
>> > This one is the simplest, I think and works well on my test set.
>> > ==
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>> >
>> > In rmap.c, at checking rmap in vma chain in page->mapping, anon_vma->lock
>> > or mapping->i_mmap_lock is held and enter following loop.
>> >
>> > Â Â for_each_vma_in_this_rmap_link(list from page->mapping) {
>> > Â Â Â Â Â Â unsigned long address = vma_address(page, vma);
>> > Â Â Â Â Â Â if (address == -EFAULT)
>> > Â Â Â Â Â Â Â Â Â Â continue;
>> > Â Â Â Â Â Â ....
>> > Â Â }
>> >
>> > vma_address is checking [start, end, pgoff] v.s. page->index.
>> >
>> > But vma's [start, end, pgoff] is updated without locks. vma_address()
>> > can hit a race and may return wrong result.
>> >
>> > This bahavior is no problem in usual routine as try_to_unmap() etc...
>> > But for page migration, rmap_walk() has to find all migration_ptes
>> > which migration code overwritten valid ptes. This race is critical and cause
>> > BUG that a migration_pte is sometimes not removed.
>> >
>> > pr 21 17:27:47 localhost kernel: ------------[ cut here ]------------
>> > Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105!
>> > Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>> > Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions
>> > Apr 21 17:27:47 localhost kernel: CPU 3
>> > Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode]
>> > Apr 21 17:27:47 localhost kernel:
>> > Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G Â Â Â ÂW Â 2.6.34-rc4-mm1+ #4 D2519/PRIMERGY
>> > Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>] Â[<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> > Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08 ÂEFLAGS: 00010246
>> > Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001
>> > Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e
>> > Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8
>> > Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: FS: Â00007fe6abc90700(0000) GS:ffff880005a00000(0000) knlGS:0000000000000000
>> > Apr 21 17:27:47 localhost kernel: CS: Â0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > Apr 21 17:27:47 localhost kernel: CR2: 00007fe6a37279a0 CR3: 000000008d942000 CR4: 00000000000006e0
>> > Apr 21 17:27:47 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > Apr 21 17:27:47 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Apr 21 17:27:47 localhost kernel: Process cc1 (pid: 27892, threadinfo ffff88008d9ee000, task ffff8800b23ec820)
>> > Apr 21 17:27:47 localhost kernel: Stack:
>> > Apr 21 17:27:47 localhost kernel: ffffea000101aee8 ffff880621a4aae8 ffff88008d9efe38 00007fe6a37279a0
>> > Apr 21 17:27:47 localhost kernel: <0> ffff8805d9706d90 ffff880621a4aa00 ffff88008d9efef8 ffffffff81126d05
>> > Apr 21 17:27:47 localhost kernel: <0> ffff88008d9efec8 0000000000000246 0000000000000000 ffffffff81586533
>> > Apr 21 17:27:47 localhost kernel: Call Trace:
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81126d05>] handle_mm_fault+0x995/0x9b0
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81586533>] ? do_page_fault+0x103/0x330
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff8104bf40>] ? finish_task_switch+0x0/0xf0
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff8158659e>] do_page_fault+0x16e/0x330
>> > Apr 21 17:27:47 localhost kernel: [<ffffffff81582f35>] page_fault+0x25/0x30
>> > Apr 21 17:27:47 localhost kernel: Code: 53 08 85 c9 0f 84 32 ff ff ff 8d 41 01 89 4d d8 89 45 d4 8b 75 d4 8b 45 d8 f0 0f b1 32 89 45 dc 8b 45 dc 39 c8 74 aa 89 c1 eb d7 <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
>> > Apr 21 17:27:47 localhost kernel: RIP Â[<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180
>> > Apr 21 17:27:47 localhost kernel: RSP <ffff88008d9efe08>
>> > Apr 21 17:27:47 localhost kernel: ---[ end trace 4860ab585c1fcddb ]---
>> >
>> > This patch adds vma_address_safe(). And update [start, end, pgoff]
>> > under seq counter.
>> >
>>
>> I had considered this idea as well as it is vaguely similar to how zones get
>> resized with a seqlock. I was hoping that the existing locking on anon_vma
>> would be usable by backing off until uncontended but maybe not so lets
>> check out this approach.
>>
>
> A possible combination of the two approaches is as follows. It uses the
> anon_vma lock mostly except where the anon_vma differs between the page
> and the VMAs being walked in which case it uses the seq counter. I've
> had it running a few hours now without problems but I'll leave it
> running at least 24 hours.
>
> ==== CUT HERE ====
> Âmm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information by protecting against vma_adjust with a combination of locks and seq counter
>
> vma_adjust() is updating anon VMA information without any locks taken.
> In constract, file-backed mappings use the i_mmap_lock. This lack of
> locking can result in races with page migration. During rmap_walk(),
> vma_address() can return -EFAULT for an address that will soon be valid.
> This leaves a dangling migration PTE behind which can later cause a
> BUG_ON to trigger when the page is faulted in.
>
> With the recent anon_vma changes, there is no single anon_vma->lock that
> can be taken that is safe for rmap_walk() to guard against changes by
> vma_adjust(). Instead, a lock can be taken on one VMA while changes
> happen to another.
>
> What this patch does is protect against updates with a combination of
> locks and seq counters. First, the vma->anon_vma lock is taken by
> vma_adjust() and the sequence counter starts. The lock is released and
> the sequence ended when the VMA updates are complete.
>
> The lock serialses rmap_walk_anon when the page and VMA share the same
> anon_vma. Where the anon_vmas do not match, the seq counter is checked.
> If a change is noticed, rmap_walk_anon drops its locks and starts again
> from scratch as the VMA list may have changed. The dangling migration
> PTE bug was not triggered after several hours of stress testing with
> this patch applied.
>
> [kamezawa.hiroyu@xxxxxxxxxxxxxx: Use of a seq counter]
> Signed-off-by: Mel Gorman <mel@xxxxxxxxx>
> ---
> Âinclude/linux/mm_types.h | Â 13 +++++++++++++
> Âmm/ksm.c         |  17 +++++++++++++++--
> Âmm/mmap.c        Â|  30 ++++++++++++++++++++++++++++++
> Âmm/rmap.c        Â|  25 ++++++++++++++++++++++++-
> Â4 files changed, 82 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index b8bb9a6..fcd5db2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -12,6 +12,7 @@
> Â#include <linux/completion.h>
> Â#include <linux/cpumask.h>
> Â#include <linux/page-debug-flags.h>
> +#include <linux/seqlock.h>
> Â#include <asm/page.h>
> Â#include <asm/mmu.h>
>
> @@ -240,6 +241,18 @@ struct mm_struct {
> Â Â Â Âstruct rw_semaphore mmap_sem;
> Â Â Â Âspinlock_t page_table_lock; Â Â Â Â Â Â /* Protects page tables and some counters */
>
> +#ifdef CONFIG_MIGRATION
> + Â Â Â /*
> + Â Â Â Â* During migration, rmap_walk walks all the VMAs mapping a particular
> + Â Â Â Â* page to remove the migration ptes. It doesn't this without mmap_sem
> + Â Â Â Â* held and the semaphore is unnecessarily heavily to take in this case.
> + Â Â Â Â* File-backed VMAs are protected by the i_mmap_lock and anon-VMAs are
> + Â Â Â Â* protected by this seq counter. If the seq counter changes while
> + Â Â Â Â* the migration PTE is being removed, the operation restarts.
> + Â Â Â Â*/
> + Â Â Â seqcount_t span_seqcounter;
> +#endif
> +
> Â Â Â Âstruct list_head mmlist; Â Â Â Â Â Â Â Â/* List of maybe swapped mm's. ÂThese are globally strung
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â * together off init_mm.mmlist, and are protected
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â * by mmlist_lock
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 3666d43..613c762 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1671,11 +1671,24 @@ again:
> Â Â Â Â Â Â Â Âstruct anon_vma_chain *vmac;
> Â Â Â Â Â Â Â Âstruct vm_area_struct *vma;
>
> +retry:
> Â Â Â Â Â Â Â Âspin_lock(&anon_vma->lock);
> Â Â Â Â Â Â Â Âlist_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
> + Â Â Â Â Â Â Â Â Â Â Â unsigned long update_race;
> + Â Â Â Â Â Â Â Â Â Â Â bool outside;
> Â Â Â Â Â Â Â Â Â Â Â Âvma = vmac->vma;
> - Â Â Â Â Â Â Â Â Â Â Â if (rmap_item->address < vma->vm_start ||
> - Â Â Â Â Â Â Â Â Â Â Â Â Â rmap_item->address >= vma->vm_end)
> +
> + Â Â Â Â Â Â Â Â Â Â Â /* See comment in rmap_walk_anon about reading anon VMA info */
> + Â Â Â Â Â Â Â Â Â Â Â update_race = read_seqcount_begin(&vma->vm_mm->span_seqcounter);
> + Â Â Â Â Â Â Â Â Â Â Â outside = rmap_item->address < vma->vm_start ||
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â rmap_item->address >= vma->vm_end;
> + Â Â Â Â Â Â Â Â Â Â Â if (anon_vma != vma->anon_vma &&
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â read_seqcount_retry(&vma->vm_mm->span_seqcounter, update_race)) {
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â spin_unlock(&anon_vma->lock);
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â goto retry;
> + Â Â Â Â Â Â Â Â Â Â Â }
> +
> + Â Â Â Â Â Â Â Â Â Â Â if (outside)
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âcontinue;
> Â Â Â Â Â Â Â Â Â Â Â Â/*
> Â Â Â Â Â Â Â Â Â Â Â Â * Initially we examine only the vma which covers this
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f90ea92..1508c43 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -491,6 +491,26 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
> Â Â Â Â Â Â Â Âmm->mmap_cache = prev;
> Â}
>
> +#ifdef CONFIG_MIGRATION
> +static void vma_span_seqbegin(struct vm_area_struct *vma)
> +{
> + Â Â Â write_seqcount_begin(&vma->vm_mm->span_seqcounter);
> +}
> +
> +static void vma_span_seqend(struct vm_area_struct *vma)
> +{
> + Â Â Â write_seqcount_end(&vma->vm_mm->span_seqcounter);
> +}
> +#else
> +static inline void vma_span_seqbegin(struct vm_area_struct *vma)
> +{
> +}
> +
> +static void adjust_end_vma(struct vm_area_struct *vma)
> +{
> +}
> +#endif /* CONFIG_MIGRATION */
> +
> Â/*
> Â* We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
> Â* is already present in an i_mmap tree without adjusting the tree.
> @@ -578,6 +598,11 @@ again: Â Â Â Â Â Â Â Â Â Â remove_next = 1 + (end > next->vm_end);
> Â Â Â Â Â Â Â Â}
> Â Â Â Â}
>
> + Â Â Â if (vma->anon_vma) {
> + Â Â Â Â Â Â Â spin_lock(&vma->anon_vma->lock);


Actually, I can't understand why we need to hold lock of vma->anon_vma->lock?
I think seqcounter is enough.

I looked at your scenarion.

More exactly, your scenario about unmap is following as.

1. VMA A-Lower - VMA A-Upper (include hole)
2. VMA A-Lower - VMA hole - VMA A-Upper(except hole)
3. VMA A-Lower - hole is remove at last - VMA A-Upper.

I mean VMA A-upper is already linkeded at vma list through
__insert_vm_struct atomically(by seqcounter).
So rmap can find proper entry, I think.

What am I missing? :)

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/