Re: v4.6 kernel BUG at mm/rmap.c:1101!

From: Andrea Arcangeli
Date: Mon May 23 2016 - 11:19:03 EST


On Mon, May 23, 2016 at 05:24:59PM +0300, Kirill A. Shutemov wrote:
> On Mon, May 23, 2016 at 05:06:38PM +0300, Mika Westerberg wrote:
> > Hi,
> >
> > After upgrading kernel of my desktop system from v4.6-rc7 to v4.6, I've
> > started seeing following:
> >
> > [176611.093747] page:ffffea0000360000 count:1 mapcount:0 mapping:ffff880034d2e0a1 index:0x1f9b06600 compound_mapcount: 0
> > [176611.093751] flags: 0x3fff8000044079(locked|uptodate|dirty|lru|active|head|swapbacked)
> > [176611.093752] page dumped because: VM_BUG_ON_PAGE(page->index != linear_page_index(vma, address))
> > [176611.093753] page->mem_cgroup:ffff88049e81b800
> > [176611.093765] ------------[ cut here ]------------
> > [176611.093778] kernel BUG at mm/rmap.c:1101!
> > [176611.093787] invalid opcode: 0000 [#1] PREEMPT SMP
> > [176611.093800] Modules linked in: vfat fat usb_storage fuse bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables pl2303 snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hwdep snd_hda_core kvm snd_seq snd_seq_device iTCO_wdt iTCO_vendor_support snd_pcm mxm_wmi irqbypass crct10dif_pclmul joydev crc32_pclmul crc32c_intel mei_me snd_timer ghash_clmulni_intel snd mei lpc_ich i2c_i801 shpchp mfd_core soundcore wmi i915 drm_kms_helper drm e1000e igb serio_raw dca i2c_algo_bit i2c_core ptp pps_core video
> > [176611.093947] CPU: 1 PID: 2851 Comm: BrowserBlocking Tainted: G I 4.6.0 #71
> > [176611.093962] Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD7 TH/Z87X-UD7 TH-CF, BIOS F4 03/18/2014
> > [176611.093981] task: ffff880492193600 ti: ffff8804971e0000 task.ti: ffff8804971e0000
> > [176611.093996] RIP: 0010:[<ffffffff811dbcb3>] [<ffffffff811dbcb3>] page_move_anon_rmap+0x93/0xa0
> > [176611.094018] RSP: 0000:ffff8804971e3d58 EFLAGS: 00010296
> > [176611.094030] RAX: 0000000000000021 RBX: ffffea0000360000 RCX: 0000000000000002
> > [176611.094045] RDX: 0000000080000002 RSI: ffffffff81a2dce2 RDI: 00000000ffffffff
> > [176611.094059] RBP: ffff8804971e3d70 R08: 0000000000016e39 R09: 0000000000000004
> > [176611.094074] R10: 800000000d81f065 R11: ffffffff81f19c4e R12: ffff880034d2e0a0
> > [176611.094088] R13: 00000001f9b06600 R14: ffffea00003607c0 R15: ffff880495b3bc00
> > [176611.094103] FS: 00007f0a91e71700(0000) GS:ffff8804af240000(0000) knlGS:0000000000000000
> > [176611.094119] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [176611.094131] CR2: 00001f9b0661fcc8 CR3: 0000000497097000 CR4: 00000000001406e0
> > [176611.094146] Stack:
> > [176611.094151] ffff880042301398 00001f9b0661fcc8 ffffea0011c746b0 ffff8804971e3df8
> > [176611.094169] ffffffff811ccdd7 000000000000000c ffff880471d1a0f8 ffff880498d2f198
> > [176611.094186] 0000000000000001 ffff8804971e3e50 ffffffff8119b156 0000000000000001
> > [176611.094203] Call Trace:
> > [176611.094213] [<ffffffff811ccdd7>] do_wp_page+0x487/0x710
> > [176611.094225] [<ffffffff8119b156>] ? generic_file_read_iter+0x606/0x6f0
> > [176611.094238] [<ffffffff811cf1e9>] handle_mm_fault+0xf59/0x1d30
> > [176611.094252] [<ffffffff8121eef7>] ? __vfs_read+0xa7/0xd0
> > [176611.094266] [<ffffffff81066298>] __do_page_fault+0x1a8/0x520
> > [176611.094280] [<ffffffff81066632>] do_page_fault+0x22/0x30
> > [176611.094295] [<ffffffff81759508>] page_fault+0x28/0x30
> > [176611.094306] Code: 20 05 a1 81 e8 2f d0 fe ff 0f 0b e8 68 ce fe ff 0f 0b 48 89 d6 e8 ee 32 01 00 eb cd 48 c7 c6 b0 2e a1 81 48 89 df e8 0d d0 fe ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89
> > [176611.094386] RIP [<ffffffff811dbcb3>] page_move_anon_rmap+0x93/0xa0
> > [176611.094400] RSP <ffff8804971e3d58>
> > [176611.099920] ---[ end trace d9cb6b7ad0bd6c55 ]---
> > [176611.099922] note: BrowserBlocking[2851] exited with preempt_count 1
> >
> > I haven't bisected this yet but there seems to be only one commit
> > touching mm in v4.6 so I kind of suspect that it has something to do
> > with the issue. I'll try to revert it next and see if that changes
> > anything.
> >
> > I've seen the issue now few times but I have no easy means to reproduce
> > it. Only thing that seems to be consistent is the fact that the running
> > process is always chrome.
> >
> > The commit in question is:
> >
> > 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages
> > during WP faults").
> >
> > Does this ring any bells? Thanks in advance.
>
> Looks like we forgot to align address if the page is huge.
> I'm not sure if caller or callee should do this.
>
> Below is callee version.
>
> Note that we use address only in CONFIG_DEBUG_VM=y case and the bug is not
> visible on production kernels with the option disabled.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8a839935b18c..0ea5d9071b32 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1098,6 +1098,8 @@ void page_move_anon_rmap(struct page *page,
>
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> VM_BUG_ON_VMA(!anon_vma, vma);
> + if (IS_ENABLED(CONFIG_DEBUG_VM) && PageTransHuge(page))
> + address &= HPAGE_PMD_MASK;
> VM_BUG_ON_PAGE(page->index != linear_page_index(vma, address), page);
>
> anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;

Reviewed-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>

Just sent a patch doing the exact same thing just emebedded in the
VM_BUG_ON_PAGE, either version is fine with me.