Re: x86: mm: Fix vmalloc_fault oops during lazy MMU updates.

From: Konrad Rzeszutek Wilk
Date: Fri Feb 22 2013 - 20:06:35 EST

Next message: Konrad Rzeszutek Wilk: "Re: Regression introduced by805d410fb0dbd65e1a57a810858fa2491e75822d (ACPI: Separate adding ACPI deviceobjects from probing ACPI drivers) in v3.9-rc0"
Previous message: Rafael J. Wysocki: "Re: [GIT PATCH] USB patches for 3.9-rc1"
In reply to: Samu Kallio: "Re: x86: mm: Fix vmalloc_fault oops during lazy MMU updates."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Feb 21, 2013 at 05:56:35PM +0200, Samu Kallio wrote:
> On Thu, Feb 21, 2013 at 2:33 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@xxxxxxxxxx> wrote:
> > On Sun, Feb 17, 2013 at 02:35:52AM -0000, Samu Kallio wrote:
> >> In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops
> >> when lazy MMU updates are enabled, because set_pgd effects are being
> >> deferred.
> >>
> >> One instance of this problem is during process mm cleanup with memory
> >> cgroups enabled. The chain of events is as follows:
> >>
> >> - zap_pte_range enables lazy MMU updates
> >> - zap_pte_range eventually calls mem_cgroup_charge_statistics,
> >> which accesses the vmalloc'd mem_cgroup per-cpu stat area
> >> - vmalloc_fault is triggered which tries to sync the corresponding
> >> PGD entry with set_pgd, but the update is deferred
> >> - vmalloc_fault oopses due to a mismatch in the PUD entries
> >>
> >> Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the
> >> changes visible to the consistency checks.
> >
> > How do you reproduce this? Is there a BUG() or WARN() trace that
> > is triggered when this happens?
>
> In my case I've seen this triggered on an Amazon EC2 (Xen PV) instance
> under heavy load spawning many LXC containers. The best I can say at
> this point is that the frequency of this bug seems to be linked to how
> busy the machine is.
>
> The earliest report of this problem was from 3.3:
> http://comments.gmane.org/gmane.linux.kernel.cgroups/5540
> I can personally confirm the issue since 3.5.
>
> Here's a sample bug report from a 3.7 kernel (vanilla with Xen XSAVE patch
> for EC2 compatibility). The latest kernel version I have tested and seen this
> problem occur is 3.7.9.

Ingo,

I am OK with this patch. Are you OK taking this in or should I take
it (and add the nice RIP below)?

It should also have CC: stable@xxxxxxxxxxxxxxx on it.

FYI, There is also a Red Hat bug for this: https://bugzilla.redhat.com/show_bug.cgi?id=914737

>
> [11852214.733630] ------------[ cut here ]------------
> [11852214.733642] kernel BUG at arch/x86/mm/fault.c:397!
> [11852214.733648] invalid opcode: 0000 [#1] SMP
> [11852214.733654] Modules linked in: veth xt_nat xt_comment fuse btrfs
> libcrc32c zlib_deflate ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat
> xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
> bridge stp llc iptable_filter ip_tables x_tables ghash_clmulni_intel
> aesni_intel aes_x86_64 ablk_helper cryptd xts lrw gf128mul microcode
> ext4 crc16 jbd2 mbcache
> [11852214.733695] CPU 1
> [11852214.733700] Pid: 1617, comm: qmgr Not tainted 3.7.0-1-ec2 #1
> [11852214.733705] RIP: e030:[<ffffffff8143018d>] [<ffffffff8143018d>]
> vmalloc_fault+0x14b/0x249
> [11852214.733725] RSP: e02b:ffff88083e57d7f8 EFLAGS: 00010046
> [11852214.733730] RAX: 0000000854046000 RBX: ffffe8ffffc80d70 RCX:
> ffff880000000000
> [11852214.733736] RDX: 00003ffffffff000 RSI: ffff880854046ff8 RDI:
> 0000000000000000
> [11852214.733744] RBP: ffff88083e57d818 R08: 0000000000000000 R09:
> ffff880000000ff8
> [11852214.733750] R10: 0000000000007ff0 R11: 0000000000000001 R12:
> ffff880854686e88
> [11852214.733758] R13: ffffffff8180ce88 R14: ffff88083e57d948 R15:
> 0000000000000000
> [11852214.733768] FS: 00007ff3bf0f8740(0000)
> GS:ffff88088b480000(0000) knlGS:0000000000000000
> [11852214.733777] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [11852214.733782] CR2: ffffe8ffffc80d70 CR3: 0000000854686000 CR4:
> 0000000000002660
> [11852214.733790] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [11852214.733796] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [11852214.733803] Process qmgr (pid: 1617, threadinfo
> ffff88083e57c000, task ffff88084474b3e0)
> [11852214.733810] Stack:
> [11852214.733814] 0000000000000029 0000000000000002 ffffe8ffffc80d70
> ffff88083e57d948
> [11852214.733828] ffff88083e57d928 ffffffff8103e0c7 0000000000000000
> ffff88083e57d8d0
> [11852214.733840] ffff88084474b3e0 0000000000000060 0000000000000000
> 0000000000006cf6
> [11852214.733852] Call Trace:
> [11852214.733861] [<ffffffff8103e0c7>] __do_page_fault+0x2c7/0x4a0
> [11852214.733871] [<ffffffff81004ac2>] ? xen_mc_flush+0xb2/0x1b0
> [11852214.733880] [<ffffffff810032ce>] ? xen_end_context_switch+0x1e/0x30
> [11852214.733888] [<ffffffff810043cb>] ? xen_write_msr_safe+0x9b/0xc0
> [11852214.733900] [<ffffffff810125b3>] ? __switch_to+0x163/0x4a0
> [11852214.733907] [<ffffffff8103e2de>] do_page_fault+0xe/0x10
> [11852214.733919] [<ffffffff81437f98>] page_fault+0x28/0x30
> [11852214.733930] [<ffffffff8115e873>] ?
> mem_cgroup_charge_statistics.isra.12+0x13/0x50
> [11852214.733940] [<ffffffff8116012e>] __mem_cgroup_uncharge_common+0xce/0x2d0
> [11852214.733948] [<ffffffff81007fee>] ? xen_pte_val+0xe/0x10
> [11852214.733958] [<ffffffff8116391a>] mem_cgroup_uncharge_page+0x2a/0x30
> [11852214.733966] [<ffffffff81139e78>] page_remove_rmap+0xf8/0x150
> [11852214.733976] [<ffffffff8112d78a>] ? vm_normal_page+0x1a/0x80
> [11852214.733984] [<ffffffff8112e5b3>] unmap_single_vma+0x573/0x860
> [11852214.733994] [<ffffffff81114520>] ? release_pages+0x1f0/0x230
> [11852214.734004] [<ffffffff810054aa>] ? __xen_pgd_walk+0x16a/0x260
> [11852214.734018] [<ffffffff8112f0b2>] unmap_vmas+0x52/0xa0
> [11852214.734026] [<ffffffff81136e08>] exit_mmap+0x98/0x170
> [11852214.734034] [<ffffffff8104b929>] mmput+0x59/0x110
> [11852214.734043] [<ffffffff81053d95>] exit_mm+0x105/0x130
> [11852214.734051] [<ffffffff814376e0>] ? _raw_spin_lock_irq+0x10/0x40
> [11852214.734059] [<ffffffff81053f27>] do_exit+0x167/0x900
> [11852214.734070] [<ffffffff8106093d>] ? __sigqueue_free+0x3d/0x50
> [11852214.734079] [<ffffffff81060b9e>] ? __dequeue_signal+0x10e/0x1f0
> [11852214.734087] [<ffffffff810549ff>] do_group_exit+0x3f/0xb0
> [11852214.734097] [<ffffffff81063431>] get_signal_to_deliver+0x1c1/0x5e0
> [11852214.734107] [<ffffffff8101334f>] do_signal+0x3f/0x960
> [11852214.734114] [<ffffffff811aae61>] ? ep_poll+0x2a1/0x360
> [11852214.734122] [<ffffffff81083420>] ? try_to_wake_up+0x2d0/0x2d0
> [11852214.734129] [<ffffffff81013cd8>] do_notify_resume+0x48/0x60
> [11852214.734138] [<ffffffff81438a5a>] int_signal+0x12/0x17
> [11852214.734143] Code: ff ff 3f 00 00 48 21 d0 4c 8d 0c 30 ff 14 25
> b8 f3 81 81 48 21 d0 48 01 c6 48 83 3e 00 0f 84 fa 00 00 00 49 8b 39
> 48 85 ff 75 02 <0f> 0b ff 14 25 e0 f3 81 81 49 89 c0 48 8b 3e ff 14 25
> e0 f3 81
> [11852214.734212] RIP [<ffffffff8143018d>] vmalloc_fault+0x14b/0x249
> [11852214.734222] RSP <ffff88083e57d7f8>
> [11852214.734231] ---[ end trace 81ac798210f95867 ]---
> [11852214.734237] Fixing recursive fault but reboot is needed!
>
> > Also pls next time also CC me.
>
> Will do, I originally CC'd Jeremy since made some lazy MMU related
> cleanups in arch/x86/mm/fault.c, and I thought he might have a comment
> on this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Konrad Rzeszutek Wilk: "Re: Regression introduced by805d410fb0dbd65e1a57a810858fa2491e75822d (ACPI: Separate adding ACPI deviceobjects from probing ACPI drivers) in v3.9-rc0"
Previous message: Rafael J. Wysocki: "Re: [GIT PATCH] USB patches for 3.9-rc1"
In reply to: Samu Kallio: "Re: x86: mm: Fix vmalloc_fault oops during lazy MMU updates."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]