Re: [BISECTED] Xen HVM guest hangs since 3.12-rc5

From: Konrad Rzeszutek Wilk
Date: Fri Feb 21 2014 - 15:08:26 EST


On Thu, Feb 20, 2014 at 12:44:15PM -0800, Steven Noonan wrote:
> On Wed, Feb 19, 2014 at 1:01 PM, Steven Noonan <steven@xxxxxxxxxxxxxx> wrote:
> > On Wed, Feb 19, 2014 at 9:41 AM, Konrad Rzeszutek Wilk
> > <konrad.wilk@xxxxxxxxxx> wrote:
> >> On Tue, Feb 18, 2014 at 11:16:05PM -0800, Steven Noonan wrote:
> >>> I've been running into problems on an Xen HVM domU. I've got a guest with NUMA
> >>> enabled, 60GB of RAM, and 3 disks attached (including root volume). 2 of the
> >>> disks are in an MD RAID0 in the guest, with an ext4 filesystem on top of that.
> >>> I was running the fio 'iometer-file-access-server.fio' example config against
> >>> that fs. During this workload, it would eventually cause a soft lockup, like
> >>> the below:
> >>
> >> I presume since you mention NUMA and Mel is CC-ed that if you boot without
> >> NUMA enabled (either via the toolstack or via Linux command line) - the issue
> >> is not present?
> >
> > I mentioned NUMA because the bisected commit is sched/numa, and the
> > guest is NUMA-enabled. I hadn't attempted booting with NUMA off. I
> > just tried with numa=off, and the workload has run in a loop for 20
> > minutes so far with no issues (normally the issue would repro in less
> > than 5).
>
> The subject line is actually incorrect -- I did a 'git describe' on
> the result of the bisection when writing the subject line, but the
> '3.12-rc5' tag was just the base on which the code was originally
> developed. As far as what tags actually contain the commit:
>
> $ git tag --contains b795854b1fa70f6aee923ae5df74ff7afeaddcaa
> v3.13
> v3.13-rc1
> v3.13-rc2
> v3.13-rc3
> v3.13-rc4
> v3.13-rc5
> v3.13-rc6
> v3.13-rc7
> v3.13-rc8
> v3.13.1
> v3.13.2
> v3.13.3
> v3.14-rc1
> v3.14-rc2
>
> So it's more accurate to say it was introduced in the v3.13 merge window.
>
> In any case, does anyone have any ideas?

There is nothing in that git commit that gives that 'AHA' feeling.

If you revert that patch on top of the latest Linux kernel does the problem
go away? This is more of a double-check to see if the commit
is really the fault or if it exposed some latent issue.

>
> >>>
> >>> [ 2536.250054] BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u257:0:7]
> >>> [ 2536.250054] Modules linked in: isofs crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd raid0 md_mod acpi_cpufreq psmouse i2c_piix4 intel_agp intel_gtt i2c_core processor serio_raw evdev microcode ext4 crc16 mbcache jbd2 ata_generic pata_acpi ata_piix libata scsi_mod floppy ixgbevf xen_privcmd xen_netfront xen_kbdfront syscopyarea sysfillrect sysimgblt fb_sys_fops xen_blkfront virtio_pci virtio_net virtio_blk virtio_ring virtio ipmi_poweroff ipmi_msghandler button
> >>> [ 2536.250054] CPU: 0 PID: 7 Comm: kworker/u257:0 Tainted: G W 3.12.0-rc4-bisect-00073-g6fe6b2d #26
> >>> [ 2536.250054] Hardware name: Xen HVM domU, BIOS 4.2.amazon 01/14/2014
> >>> [ 2536.250054] Workqueue: writeback bdi_writeback_workfn (flush-202:0)
> >>> [ 2536.250054] task: ffff880766533400 ti: ffff88076652e000 task.ti: ffff88076652e000
> >>> [ 2536.250054] RIP: 0010:[<ffffffff810cc518>] [<ffffffff810cc518>] smp_call_function_many+0x258/0x2b0
> >>> [ 2536.250054] RSP: 0018:ffff88076652f878 EFLAGS: 00000202
> >>> [ 2536.250054] RAX: 000000000000000f RBX: ffff88076652f808 RCX: ffff880ef0ef74a8
> >>> [ 2536.250054] RDX: 000000000000000f RSI: 0000000000000080 RDI: 0000000000000000
> >>> [ 2536.250054] RBP: ffff88076652f8c0 R08: ffff880771046c00 R09: ffff880770c008e0
> >>> [ 2536.250054] R10: 000000000000003e R11: 0000000000000210 R12: ffff88076652f7f0
> >>> [ 2536.250054] R13: ffffffff810b859e R14: ffff88076652f7e0 R15: ffffffff810b50e7
> >>> [ 2536.250054] FS: 0000000000000000(0000) GS:ffff880771600000(0000) knlGS:0000000000000000
> >>> [ 2536.250054] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [ 2536.250054] CR2: 00007f8752bea000 CR3: 0000000001a0d000 CR4: 00000000001406f0
> >>> [ 2536.250054] Stack:
> >>> [ 2536.250054] 0000000181275231 0000000000014d00 ffff88076652f8d0 ffffffff810564e0
> >>> [ 2536.250054] ffff88076530b180 00007f0c8826a000 ffff880ed4d57700 ffff88076530b180
> >>> [ 2536.250054] ffff880ed4cc6350 ffff88076652f8e8 ffffffff81056637 ffff88076530b180
> >>> [ 2536.250054] Call Trace:
> >>> [ 2536.250054] [<ffffffff810564e0>] ? leave_mm+0x70/0x70
> >>> [ 2536.250054] [<ffffffff81056637>] native_flush_tlb_others+0x37/0x40
> >>> [ 2536.250054] [<ffffffff81056988>] flush_tlb_page+0x88/0x90
> >>> [ 2536.250054] [<ffffffff8117bb94>] ptep_clear_flush+0x34/0x40
> >>> [ 2536.250054] [<ffffffff81175b1e>] page_mkclean+0x12e/0x1d0
> >>> [ 2536.250054] [<ffffffff8114aeeb>] clear_page_dirty_for_io+0x3b/0xe0
> >>> [ 2536.250054] [<ffffffffa016fd52>] mpage_submit_page+0x52/0x80 [ext4]
> >>> [ 2536.250054] [<ffffffffa016fe89>] mpage_process_page_bufs+0x109/0x140 [ext4]
> >>> [ 2536.250054] [<ffffffffa01700d7>] mpage_prepare_extent_to_map+0x217/0x2d0 [ext4]
> >>> [ 2536.250054] [<ffffffffa0174929>] ext4_writepages+0x469/0xca0 [ext4]
> >>> [ 2536.250054] [<ffffffff8114cd3e>] do_writepages+0x1e/0x50
> >>> [ 2536.250054] [<ffffffff811d7716>] __writeback_single_inode+0x76/0x240
> >>> [ 2536.250054] [<ffffffff811d7c12>] writeback_sb_inodes+0x282/0x420
> >>> [ 2536.250054] [<ffffffff811d7e2f>] __writeback_inodes_wb+0x7f/0xd0
> >>> [ 2536.250054] [<ffffffff811d884b>] wb_writeback+0x15b/0x2a0
> >>> [ 2536.250054] [<ffffffff811d8fa7>] bdi_writeback_workfn+0x1d7/0x450
> >>> [ 2536.250054] [<ffffffff8107be4d>] process_one_work+0x25d/0x460
> >>> [ 2536.250054] [<ffffffff8107d196>] worker_thread+0x266/0x480
> >>> [ 2536.250054] [<ffffffff8107cf30>] ? manage_workers.isra.18+0x3f0/0x3f0
> >>> [ 2536.250054] [<ffffffff81083bcb>] kthread+0xbb/0xd0
> >>> [ 2536.250054] [<ffffffff81083b10>] ? kthread_stop+0xf0/0xf0
> >>> [ 2536.250054] [<ffffffff814d4dbc>] ret_from_fork+0x7c/0xb0
> >>> [ 2536.250054] [<ffffffff81083b10>] ? kthread_stop+0xf0/0xf0
> >>> [ 2536.250054] Code: 00 74 70 48 63 35 d1 1f a1 00 ba ff ff ff ff eb 29 66 90 48 98 48 8b 0b 48 03 0c c5 00 27 ad 81 f6 41 20 01 74 14 0f 1f 44 00 00 <f3> 90 f6 41 20 01 75 f8 48 63 35 a1 1f a1 00 48 8b 7b 08 83 c2
> >>> [ 2544.900055] BUG: soft lockup - CPU#31 stuck for 24s! [systemd-journal:304]
> >>> [ 2544.900055] Modules linked in: isofs crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd raid0 md_mod acpi_cpufreq psmouse i2c_piix4 intel_agp intel_gtt i2c_core processor serio_raw evdev microcode ext4 crc16 mbcache jbd2 ata_generic pata_acpi ata_piix libata scsi_mod floppy ixgbevf xen_privcmd xen_netfront xen_kbdfront syscopyarea sysfillrect sysimgblt fb_sys_fops xen_blkfront virtio_pci virtio_net virtio_blk virtio_ring virtio ipmi_poweroff ipmi_msghandler button
> >>> [ 2544.900055] CPU: 31 PID: 304 Comm: systemd-journal Tainted: G W 3.12.0-rc4-bisect-00073-g6fe6b2d #26
> >>> [ 2544.900055] Hardware name: Xen HVM domU, BIOS 4.2.amazon 01/14/2014
> >>> [ 2544.900055] task: ffff880764bcb400 ti: ffff8807653f6000 task.ti: ffff8807653f6000
> >>> [ 2544.900055] RIP: 0010:[<ffffffff810cc040>] [<ffffffff810cc040>] generic_exec_single+0x80/0xa0
> >>> [ 2544.900055] RSP: 0018:ffff8807653f7c80 EFLAGS: 00000202
> >>> [ 2544.900055] RAX: 0000000000000080 RBX: ffffffff813207fd RCX: 0000000000000080
> >>> [ 2544.900055] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000080
> >>> [ 2544.900055] RBP: ffff8807653f7cb0 R08: ffffffff8160d148 R09: ffff880770c006a8
> >>> [ 2544.900055] R10: 0000000000000020 R11: ffffea003b490700 R12: ffffffff810b859e
> >>> [ 2544.900055] R13: ffff8807653f7bf8 R14: ffffffff810b50e7 R15: ffff8807653f7be8
> >>> [ 2544.900055] FS: 00007f0c934cd780(0000) GS:ffff880ef0fe0000(0000) knlGS:0000000000000000
> >>> [ 2544.900055] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [ 2544.900055] CR2: 00007f0c934db000 CR3: 0000000764b0f000 CR4: 00000000001406e0
> >>> [ 2544.900055] Stack:
> >>> [ 2544.900055] 0000000106038000 000000000000000f 000000000000001f ffffffff81adcfe0
> >>> [ 2544.900055] ffffffff810564e0 000000000000001f ffff8807653f7d20 ffffffff810cc195
> >>> [ 2544.900055] 00007f0c934db000 ffff8807653f7cd8 ffff880ef0ef74a8 ffff880ef0ef4d00
> >>> [ 2544.900055] Call Trace:
> >>> [ 2544.900055] [<ffffffff810564e0>] ? leave_mm+0x70/0x70
> >>> [ 2544.900055] [<ffffffff810cc195>] smp_call_function_single+0x135/0x170
> >>> [ 2544.900055] [<ffffffff810564e0>] ? leave_mm+0x70/0x70
> >>> [ 2544.900055] [<ffffffff810cc3c5>] smp_call_function_many+0x105/0x2b0
> >>> [ 2544.900055] [<ffffffff810564e0>] ? leave_mm+0x70/0x70
> >>> [ 2544.900055] [<ffffffff81056637>] native_flush_tlb_others+0x37/0x40
> >>> [ 2544.900055] [<ffffffff810568ae>] flush_tlb_mm_range+0x1fe/0x250
> >>> [ 2544.900055] [<ffffffff81167917>] tlb_flush_mmu+0x37/0xa0
> >>> [ 2544.900055] [<ffffffff81167994>] tlb_finish_mmu+0x14/0x50
> >>> [ 2544.900055] [<ffffffff8116fdc5>] unmap_region+0x105/0x120
> >>> [ 2544.900055] [<ffffffff811cd01e>] ? mntput_no_expire+0x3e/0x140
> >>> [ 2544.900055] [<ffffffff81170349>] ? vma_rb_erase+0x1c9/0x210
> >>> [ 2544.900055] [<ffffffff81171f10>] do_munmap+0x280/0x370
> >>> [ 2544.900055] [<ffffffff81172041>] vm_munmap+0x41/0x60
> >>> [ 2544.900055] [<ffffffff81172f42>] SyS_munmap+0x22/0x30
> >>> [ 2544.900055] [<ffffffff814d4e6d>] system_call_fastpath+0x1a/0x1f
> >>> [ 2544.900055] Code: 48 89 4b 08 48 89 19 e8 4f 05 40 00 4d 39 fc 8b 55 d4 75 0f 44 89 f7 ff 15 5e 8d 95 00 8b 55 d4 0f 1f 00 85 d2 75 06 eb 0a 66 90 <f3> 90 f6 43 20 01 75 f8 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f
> >>>
> >>> At this point, the MD array will not accept any I/O, and any requests will just
> >>> result in additional soft lockup messages.
> >>>
> >>> I originally noticed this issue on Linux 3.13.3, and wasn't able to reproduce
> >>> it on 3.10.30. I eventually narrowed it down to a regression introduced between
> >>> 3.12 and 3.13. A bisection blames this commit:
> >>>
> >>> commit b795854b1fa70f6aee923ae5df74ff7afeaddcaa
> >>> Author: Mel Gorman <mgorman@xxxxxxx>
> >>> Date: Mon Oct 7 11:29:07 2013 +0100
> >>>
> >>> sched/numa: Set preferred NUMA node based on number of private faults
> >>>
> >>> Ideally it would be possible to distinguish between NUMA hinting faults that
> >>> are private to a task and those that are shared. If treated identically
> >>> there is a risk that shared pages bounce between nodes depending on
> >>> the order they are referenced by tasks. Ultimately what is desirable is
> >>> that task private pages remain local to the task while shared pages are
> >>> interleaved between sharing tasks running on different nodes to give good
> >>> average performance. This is further complicated by THP as even
> >>> applications that partition their data may not be partitioning on a huge
> >>> page boundary.
> >>>
> >>> To start with, this patch assumes that multi-threaded or multi-process
> >>> applications partition their data and that in general the private accesses
> >>> are more important for cpu->memory locality in the general case. Also,
> >>> no new infrastructure is required to treat private pages properly but
> >>> interleaving for shared pages requires additional infrastructure.
> >>>
> >>> To detect private accesses the pid of the last accessing task is required
> >>> but the storage requirements are a high. This patch borrows heavily from
> >>> Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
> >>> to encode some bits from the last accessing task in the page flags as
> >>> well as the node information. Collisions will occur but it is better than
> >>> just depending on the node information. Node information is then used to
> >>> determine if a page needs to migrate. The PID information is used to detect
> >>> private/shared accesses. The preferred NUMA node is selected based on where
> >>> the maximum number of approximately private faults were measured. Shared
> >>> faults are not taken into consideration for a few reasons.
> >>>
> >>> First, if there are many tasks sharing the page then they'll all move
> >>> towards the same node. The node will be compute overloaded and then
> >>> scheduled away later only to bounce back again. Alternatively the shared
> >>> tasks would just bounce around nodes because the fault information is
> >>> effectively noise. Either way accounting for shared faults the same as
> >>> private faults can result in lower performance overall.
> >>>
> >>> The second reason is based on a hypothetical workload that has a small
> >>> number of very important, heavily accessed private pages but a large shared
> >>> array. The shared array would dominate the number of faults and be selected
> >>> as a preferred node even though it's the wrong decision.
> >>>
> >>> The third reason is that multiple threads in a process will race each
> >>> other to fault the shared page making the fault information unreliable.
> >>>
> >>> Signed-off-by: Mel Gorman <mgorman@xxxxxxx>
> >>> [ Fix complication error when !NUMA_BALANCING. ]
> >>> Reviewed-by: Rik van Riel <riel@xxxxxxxxxx>
> >>> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> >>> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> >>> Cc: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
> >>> Signed-off-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >>> Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@xxxxxxx
> >>> Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
> >>>
> >>> Here's the bisection log:
> >>>
> >>> # bad: [d8ec26d7] Linux 3.13
> >>> # good: [5e01dc7b] Linux 3.12
> >>> # bad: [42a2d923] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
> >>> # bad: [4b4d2b46] Merge tag 'h8300-for-linus' of git://git.kernel.or
> >>> # skip: [c224b76b] Merge tag 'nfs-for-3.13-1' of git://git.linux-nfs.
> >>> # good: [ae1dd9bc] Staging: xillybus: quoted strings split across lin
> >>> # good: [beb5bfe4] Merge tag 'fixes-nc-for-linus' of git://git.kernel
> >>> # good: [f9efbce6] Merge tag 'dt-for-linus' of git://git.kernel.org/p
> >>> # good: [ad5d6989] Merge branch 'perf-core-for-linus' of git://git.ke
> >>> # bad: [39cf275a] Merge branch 'sched-core-for-linus' of git://git.k
> >>> # good: [e6628d5b] sched/numa: Reschedule task on preferred NUMA node
> >>> # bad: [04bb2f94] sched/numa: Adjust scan rate in task_numa_placemen
> >>> # bad: [e1dda8a7] sched/numa: Fix placement of workloads spread acro
> >>> # bad: [58d081b5] sched/numa: Avoid overloading CPUs on a preferred
> >>> # good: [073b5bee] sched/numa: Remove check that skips small VMAs
> >>> # bad: [6fe6b2d6] sched/numa: Do not migrate memory immediately afte
> >>> # bad: [b795854b] sched/numa: Set preferred NUMA node based on numbe
> >>>
> >>> Anyone have any ideas why this change broke things? Is there any additional
> >>> information I can provide to help pin this down?
> >>>
> >>> - Steven
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/