Re: x86: mm: Fix vmalloc_fault oops during lazy MMU updates.

From: Samu Kallio
Date: Thu Feb 21 2013 - 10:56:44 EST


On Thu, Feb 21, 2013 at 2:33 PM, Konrad Rzeszutek Wilk
<konrad.wilk@xxxxxxxxxx> wrote:
> On Sun, Feb 17, 2013 at 02:35:52AM -0000, Samu Kallio wrote:
>> In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops
>> when lazy MMU updates are enabled, because set_pgd effects are being
>> deferred.
>>
>> One instance of this problem is during process mm cleanup with memory
>> cgroups enabled. The chain of events is as follows:
>>
>> - zap_pte_range enables lazy MMU updates
>> - zap_pte_range eventually calls mem_cgroup_charge_statistics,
>> which accesses the vmalloc'd mem_cgroup per-cpu stat area
>> - vmalloc_fault is triggered which tries to sync the corresponding
>> PGD entry with set_pgd, but the update is deferred
>> - vmalloc_fault oopses due to a mismatch in the PUD entries
>>
>> Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the
>> changes visible to the consistency checks.
>
> How do you reproduce this? Is there a BUG() or WARN() trace that
> is triggered when this happens?

In my case I've seen this triggered on an Amazon EC2 (Xen PV) instance
under heavy load spawning many LXC containers. The best I can say at
this point is that the frequency of this bug seems to be linked to how
busy the machine is.

The earliest report of this problem was from 3.3:
http://comments.gmane.org/gmane.linux.kernel.cgroups/5540
I can personally confirm the issue since 3.5.

Here's a sample bug report from a 3.7 kernel (vanilla with Xen XSAVE patch
for EC2 compatibility). The latest kernel version I have tested and seen this
problem occur is 3.7.9.

[11852214.733630] ------------[ cut here ]------------
[11852214.733642] kernel BUG at arch/x86/mm/fault.c:397!
[11852214.733648] invalid opcode: 0000 [#1] SMP
[11852214.733654] Modules linked in: veth xt_nat xt_comment fuse btrfs
libcrc32c zlib_deflate ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat
xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
bridge stp llc iptable_filter ip_tables x_tables ghash_clmulni_intel
aesni_intel aes_x86_64 ablk_helper cryptd xts lrw gf128mul microcode
ext4 crc16 jbd2 mbcache
[11852214.733695] CPU 1
[11852214.733700] Pid: 1617, comm: qmgr Not tainted 3.7.0-1-ec2 #1
[11852214.733705] RIP: e030:[<ffffffff8143018d>] [<ffffffff8143018d>]
vmalloc_fault+0x14b/0x249
[11852214.733725] RSP: e02b:ffff88083e57d7f8 EFLAGS: 00010046
[11852214.733730] RAX: 0000000854046000 RBX: ffffe8ffffc80d70 RCX:
ffff880000000000
[11852214.733736] RDX: 00003ffffffff000 RSI: ffff880854046ff8 RDI:
0000000000000000
[11852214.733744] RBP: ffff88083e57d818 R08: 0000000000000000 R09:
ffff880000000ff8
[11852214.733750] R10: 0000000000007ff0 R11: 0000000000000001 R12:
ffff880854686e88
[11852214.733758] R13: ffffffff8180ce88 R14: ffff88083e57d948 R15:
0000000000000000
[11852214.733768] FS: 00007ff3bf0f8740(0000)
GS:ffff88088b480000(0000) knlGS:0000000000000000
[11852214.733777] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[11852214.733782] CR2: ffffe8ffffc80d70 CR3: 0000000854686000 CR4:
0000000000002660
[11852214.733790] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[11852214.733796] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[11852214.733803] Process qmgr (pid: 1617, threadinfo
ffff88083e57c000, task ffff88084474b3e0)
[11852214.733810] Stack:
[11852214.733814] 0000000000000029 0000000000000002 ffffe8ffffc80d70
ffff88083e57d948
[11852214.733828] ffff88083e57d928 ffffffff8103e0c7 0000000000000000
ffff88083e57d8d0
[11852214.733840] ffff88084474b3e0 0000000000000060 0000000000000000
0000000000006cf6
[11852214.733852] Call Trace:
[11852214.733861] [<ffffffff8103e0c7>] __do_page_fault+0x2c7/0x4a0
[11852214.733871] [<ffffffff81004ac2>] ? xen_mc_flush+0xb2/0x1b0
[11852214.733880] [<ffffffff810032ce>] ? xen_end_context_switch+0x1e/0x30
[11852214.733888] [<ffffffff810043cb>] ? xen_write_msr_safe+0x9b/0xc0
[11852214.733900] [<ffffffff810125b3>] ? __switch_to+0x163/0x4a0
[11852214.733907] [<ffffffff8103e2de>] do_page_fault+0xe/0x10
[11852214.733919] [<ffffffff81437f98>] page_fault+0x28/0x30
[11852214.733930] [<ffffffff8115e873>] ?
mem_cgroup_charge_statistics.isra.12+0x13/0x50
[11852214.733940] [<ffffffff8116012e>] __mem_cgroup_uncharge_common+0xce/0x2d0
[11852214.733948] [<ffffffff81007fee>] ? xen_pte_val+0xe/0x10
[11852214.733958] [<ffffffff8116391a>] mem_cgroup_uncharge_page+0x2a/0x30
[11852214.733966] [<ffffffff81139e78>] page_remove_rmap+0xf8/0x150
[11852214.733976] [<ffffffff8112d78a>] ? vm_normal_page+0x1a/0x80
[11852214.733984] [<ffffffff8112e5b3>] unmap_single_vma+0x573/0x860
[11852214.733994] [<ffffffff81114520>] ? release_pages+0x1f0/0x230
[11852214.734004] [<ffffffff810054aa>] ? __xen_pgd_walk+0x16a/0x260
[11852214.734018] [<ffffffff8112f0b2>] unmap_vmas+0x52/0xa0
[11852214.734026] [<ffffffff81136e08>] exit_mmap+0x98/0x170
[11852214.734034] [<ffffffff8104b929>] mmput+0x59/0x110
[11852214.734043] [<ffffffff81053d95>] exit_mm+0x105/0x130
[11852214.734051] [<ffffffff814376e0>] ? _raw_spin_lock_irq+0x10/0x40
[11852214.734059] [<ffffffff81053f27>] do_exit+0x167/0x900
[11852214.734070] [<ffffffff8106093d>] ? __sigqueue_free+0x3d/0x50
[11852214.734079] [<ffffffff81060b9e>] ? __dequeue_signal+0x10e/0x1f0
[11852214.734087] [<ffffffff810549ff>] do_group_exit+0x3f/0xb0
[11852214.734097] [<ffffffff81063431>] get_signal_to_deliver+0x1c1/0x5e0
[11852214.734107] [<ffffffff8101334f>] do_signal+0x3f/0x960
[11852214.734114] [<ffffffff811aae61>] ? ep_poll+0x2a1/0x360
[11852214.734122] [<ffffffff81083420>] ? try_to_wake_up+0x2d0/0x2d0
[11852214.734129] [<ffffffff81013cd8>] do_notify_resume+0x48/0x60
[11852214.734138] [<ffffffff81438a5a>] int_signal+0x12/0x17
[11852214.734143] Code: ff ff 3f 00 00 48 21 d0 4c 8d 0c 30 ff 14 25
b8 f3 81 81 48 21 d0 48 01 c6 48 83 3e 00 0f 84 fa 00 00 00 49 8b 39
48 85 ff 75 02 <0f> 0b ff 14 25 e0 f3 81 81 49 89 c0 48 8b 3e ff 14 25
e0 f3 81
[11852214.734212] RIP [<ffffffff8143018d>] vmalloc_fault+0x14b/0x249
[11852214.734222] RSP <ffff88083e57d7f8>
[11852214.734231] ---[ end trace 81ac798210f95867 ]---
[11852214.734237] Fixing recursive fault but reboot is needed!

> Also pls next time also CC me.

Will do, I originally CC'd Jeremy since made some lazy MMU related
cleanups in arch/x86/mm/fault.c, and I thought he might have a comment
on this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/