Re: system hung up when offlining CPUs

From: Marc Zyngier
Date: Thu Aug 10 2017 - 07:54:43 EST


+ Christoph, since he's the one who came up with the idea

On 09/08/17 20:09, YASUAKI ISHIMATSU wrote:
> Hi Marc,
>
> On 08/09/2017 07:42 AM, Marc Zyngier wrote:
>> On Tue, 8 Aug 2017 15:25:35 -0400
>> YASUAKI ISHIMATSU <yasu.isimatu@xxxxxxxxx> wrote:
>>
>>> Hi Thomas,
>>>
>>> When offlining all CPUs except cpu0, system hung up with the following message.
>>>
>>> [...] INFO: task kworker/u384:1:1234 blocked for more than 120 seconds.
>>> [...] Not tainted 4.12.0-rc6+ #19
>>> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [...] kworker/u384:1 D 0 1234 2 0x00000000
>>> [...] Workqueue: writeback wb_workfn (flush-253:0)
>>> [...] Call Trace:
>>> [...] __schedule+0x28a/0x880
>>> [...] schedule+0x36/0x80
>>> [...] schedule_timeout+0x249/0x300
>>> [...] ? __schedule+0x292/0x880
>>> [...] __down_common+0xfc/0x132
>>> [...] ? _xfs_buf_find+0x2bb/0x510 [xfs]
>>> [...] __down+0x1d/0x1f
>>> [...] down+0x41/0x50
>>> [...] xfs_buf_lock+0x3c/0xf0 [xfs]
>>> [...] _xfs_buf_find+0x2bb/0x510 [xfs]
>>> [...] xfs_buf_get_map+0x2a/0x280 [xfs]
>>> [...] xfs_buf_read_map+0x2d/0x180 [xfs]
>>> [...] xfs_trans_read_buf_map+0xf5/0x310 [xfs]
>>> [...] xfs_btree_read_buf_block.constprop.35+0x78/0xc0 [xfs]
>>> [...] xfs_btree_lookup_get_block+0x88/0x160 [xfs]
>>> [...] xfs_btree_lookup+0xd0/0x3b0 [xfs]
>>> [...] ? xfs_allocbt_init_cursor+0x41/0xe0 [xfs]
>>> [...] xfs_alloc_ag_vextent_near+0xaf/0xaa0 [xfs]
>>> [...] xfs_alloc_ag_vextent+0x13c/0x150 [xfs]
>>> [...] xfs_alloc_vextent+0x425/0x590 [xfs]
>>> [...] xfs_bmap_btalloc+0x448/0x770 [xfs]
>>> [...] xfs_bmap_alloc+0xe/0x10 [xfs]
>>> [...] xfs_bmapi_write+0x61d/0xc10 [xfs]
>>> [...] ? kmem_zone_alloc+0x96/0x100 [xfs]
>>> [...] xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
>>> [...] xfs_map_blocks+0x1e8/0x260 [xfs]
>>> [...] xfs_do_writepage+0x1ca/0x680 [xfs]
>>> [...] write_cache_pages+0x26f/0x510
>>> [...] ? xfs_vm_set_page_dirty+0x1d0/0x1d0 [xfs]
>>> [...] ? blk_mq_dispatch_rq_list+0x305/0x410
>>> [...] ? deadline_remove_request+0x7d/0xc0
>>> [...] xfs_vm_writepages+0xb6/0xd0 [xfs]
>>> [...] do_writepages+0x1c/0x70
>>> [...] __writeback_single_inode+0x45/0x320
>>> [...] writeback_sb_inodes+0x280/0x570
>>> [...] __writeback_inodes_wb+0x8c/0xc0
>>> [...] wb_writeback+0x276/0x310
>>> [...] ? get_nr_dirty_inodes+0x4d/0x80
>>> [...] wb_workfn+0x2d4/0x3b0
>>> [...] process_one_work+0x149/0x360
>>> [...] worker_thread+0x4d/0x3c0
>>> [...] kthread+0x109/0x140
>>> [...] ? rescuer_thread+0x380/0x380
>>> [...] ? kthread_park+0x60/0x60
>>> [...] ret_from_fork+0x25/0x30
>>>
>>>
>>> I bisected upstream kernel. And I found that the following commit lead
>>> the issue.
>>>
>>> commit c5cb83bb337c25caae995d992d1cdf9b317f83de
>>> Author: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>>> Date: Tue Jun 20 01:37:51 2017 +0200
>>>
>>> genirq/cpuhotplug: Handle managed IRQs on CPU hotplug
>>
>> Can you please post your /proc/interrupts and details of which
>> interrupt you think goes wrong? This backtrace is not telling us much
>> in terms of where to start looking...
>
> Thank you for giving advise.
>
> The issue is easily reproduced on physical/virtual machine by offling CPUs except cpu0.
> Here are my /proc/interrupts on kvm guest before reproducing the issue. And when offlining
> cpu1, the issue occurred. But when offling cpu0, the issue didn't occur.
>
> CPU0 CPU1
> 0: 127 0 IO-APIC 2-edge timer
> 1: 10 0 IO-APIC 1-edge i8042
> 4: 227 0 IO-APIC 4-edge ttyS0
> 6: 3 0 IO-APIC 6-edge floppy
> 8: 0 0 IO-APIC 8-edge rtc0
> 9: 0 0 IO-APIC 9-fasteoi acpi
> 10: 10822 0 IO-APIC 10-fasteoi ehci_hcd:usb1, uhci_hcd:usb2, virtio3
> 11: 23 0 IO-APIC 11-fasteoi uhci_hcd:usb3, uhci_hcd:usb4, qxl
> 12: 15 0 IO-APIC 12-edge i8042
> 14: 218 0 IO-APIC 14-edge ata_piix
> 15: 0 0 IO-APIC 15-edge ata_piix
> 24: 0 0 PCI-MSI 49152-edge virtio0-config
> 25: 359 0 PCI-MSI 49153-edge virtio0-input.0
> 26: 1 0 PCI-MSI 49154-edge virtio0-output.0
> 27: 0 0 PCI-MSI 114688-edge virtio2-config
> 28: 1 3639 PCI-MSI 114689-edge virtio2-req.0
> 29: 0 0 PCI-MSI 98304-edge virtio1-config
> 30: 4 0 PCI-MSI 98305-edge virtio1-virtqueues
> 31: 189 0 PCI-MSI 65536-edge snd_hda_intel:card0
> NMI: 0 0 Non-maskable interrupts
> LOC: 16115 12845 Local timer interrupts
> SPU: 0 0 Spurious interrupts
> PMI: 0 0 Performance monitoring interrupts
> IWI: 0 0 IRQ work interrupts
> RTR: 0 0 APIC ICR read retries
> RES: 3016 2135 Rescheduling interrupts
> CAL: 3666 557 Function call interrupts
> TLB: 65 12 TLB shootdowns
> TRM: 0 0 Thermal event interrupts
> THR: 0 0 Threshold APIC interrupts
> DFR: 0 0 Deferred Error APIC interrupts
> MCE: 0 0 Machine check exceptions
> MCP: 1 1 Machine check polls
> ERR: 0
> MIS: 0
> PIN: 0 0 Posted-interrupt notification event
> NPI: 0 0 Nested posted-interrupt event
> PIW: 0 0 Posted-interrupt wakeup event

I was able to reproduce this with an arm64 VM pretty easily. The issue
is that (in the above case), IRQ27 is affine to CPU0, and IRQ28 to CPU1.
If you have more CPUs, IRQ27 is affine to the first half of the CPUs,
and IRQ28 to the others.

When CPU1 is offlined, the fact that we have a "managed" interrupt
affinity prevents the interrupt from being moved to CPU0, and you loose
your disk. I don't think that's the expected effect... The opposite case
(offlining CPU0) only "works" because you're not getting any config
interrupt (IOW, you're lucky).

I'm not sure how this is supposed to work. Shutting down the interrupt
in migrate_one_irq() just breaks everything (unless the driver somehow
knows about it, which doesn't seem to be the case).

Christoph?

M.
--
Jazz is not dead. It just smells funny...