Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector
From: Laurence Oberman
Date: Mon Jan 15 2018 - 12:54:49 EST
On Mon, 2018-01-15 at 18:43 +0100, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Ming Lei wrote:
> > These two patches fixes IO hang issue reported by Laurence.
> >
> > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > may cause one irq vector assigned to all offline CPUs, then this
> > vector
> > can't handle irq any more.
> >
> > The 1st patch moves irq vectors spread into one function, and
> > prepares
> > for the fix done in 2nd patch.
> >
> > The 2nd patch fixes the issue by trying to make sure online CPUs
> > assigned
> > to irq vector.
>
> Which means it's completely undoing the intent and mechanism of
> managed
> interrupts. Not going to happen.
>
> Which driver is that which abuses managed interrupts and does not
> keep its
> queues properly sorted on cpu hotplug?
>
> Thanks,
>
> tglx
Hello Thomas
The servers I am using are all booting off hpsa (SmartArray)
The system would hang on boot with this stack below.
So seen when booting off hpsa driver, not seen by Mike when booting off
a server not using hpsa.
Also not seen when reverting the patch I called out and reverted.
Putting that patch back into Mike/Jens combined tree and adding Ming's
patch seems to fix this issue now. I can boot.
I just did a quick sanity boot and check, not any in-depth testing
right now.
Its not code I am at all familiar with that Ming has changed to make it
work so I defer to Ming to explain in-depth
[ÂÂ246.751050] INFO: task systemd-udevd:411 blocked for more than 120
seconds.
[ÂÂ246.791852]ÂÂÂÂÂÂÂTainted: GÂÂÂÂÂÂÂÂÂÂIÂÂÂÂÂÂ4.15.0-
rc4.block.dm.4.16+ #1
[ÂÂ246.830650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[ÂÂ246.874637] systemd-udevdÂÂÂDÂÂÂÂ0ÂÂÂ411ÂÂÂÂ408 0x80000004
[ÂÂ246.904934] Call Trace:
[ÂÂ246.918191]ÂÂ? __schedule+0x28d/0x870
[ÂÂ246.937643]ÂÂ? _cond_resched+0x15/0x30
[ÂÂ246.958222]ÂÂschedule+0x32/0x80
[ÂÂ246.975424]ÂÂasync_synchronize_cookie_domain+0x8b/0x140
[ÂÂ247.004452]ÂÂ? remove_wait_queue+0x60/0x60
[ÂÂ247.027335]ÂÂdo_init_module+0xbe/0x219
[ÂÂ247.048022]ÂÂload_module+0x21d6/0x2910
[ÂÂ247.069436]ÂÂ? m_show+0x1c0/0x1c0
[ÂÂ247.087999]ÂÂSYSC_finit_module+0x94/0xe0
[ÂÂ247.110392]ÂÂentry_SYSCALL_64_fastpath+0x1a/0x7d
[ÂÂ247.136669] RIP: 0033:0x7f84049287f9
[ÂÂ247.156112] RSP: 002b:00007ffd13199ab8 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[ÂÂ247.196883] RAX: ffffffffffffffda RBX: 000055b712b59e80 RCX:
00007f84049287f9
[ÂÂ247.237989] RDX: 0000000000000000 RSI: 00007f8405245099 RDI:
0000000000000008
[ÂÂ247.279105] RBP: 00007f8404bf2760 R08: 0000000000000000 R09:
000055b712b45760
[ÂÂ247.320005] R10: 0000000000000008 R11: 0000000000000246 R12:
0000000000000020
[ÂÂ247.360625] R13: 00007f8404bf2818 R14: 0000000000000050 R15:
00007f8404bf27b8
[ÂÂ247.401062] INFO: task scsi_eh_0:471 blocked for more than 120
seconds.
[ÂÂ247.438161]ÂÂÂÂÂÂÂTainted: GÂÂÂÂÂÂÂÂÂÂIÂÂÂÂÂÂ4.15.0-
rc4.block.dm.4.16+ #1
[ÂÂ247.476640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[ÂÂ247.520700] scsi_eh_0ÂÂÂÂÂÂÂDÂÂÂÂ0ÂÂÂ471ÂÂÂÂÂÂ2 0x80000000
[ÂÂ247.551339] Call Trace:
[ÂÂ247.564360]ÂÂ? __schedule+0x28d/0x870
[ÂÂ247.584720]ÂÂschedule+0x32/0x80
[ÂÂ247.601294]ÂÂhpsa_eh_device_reset_handler+0x68c/0x700 [hpsa]
[ÂÂ247.633358]ÂÂ? remove_wait_queue+0x60/0x60
[ÂÂ247.656345]ÂÂscsi_try_bus_device_reset+0x27/0x40
[ÂÂ247.682424]ÂÂscsi_eh_ready_devs+0x53f/0xe20
[ÂÂ247.706467]ÂÂ? __pm_runtime_resume+0x55/0x70
[ÂÂ247.730327]ÂÂscsi_error_handler+0x434/0x5e0
[ÂÂ247.754387]ÂÂ? __schedule+0x295/0x870
[ÂÂ247.775420]ÂÂkthread+0xf5/0x130
[ÂÂ247.793461]ÂÂ? scsi_eh_get_sense+0x240/0x240
[ÂÂ247.818008]ÂÂ? kthread_associate_blkcg+0x90/0x90
[ÂÂ247.844759]ÂÂret_from_fork+0x1f/0x30
[ÂÂ247.865440] INFO: task scsi_id:488 blocked for more than 120
seconds.
[ÂÂ247.901112]ÂÂÂÂÂÂÂTainted: GÂÂÂÂÂÂÂÂÂÂIÂÂÂÂÂÂ4.15.0-
rc4.block.dm.4.16+ #1
[ÂÂ247.938743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[ÂÂ247.981092] scsi_idÂÂÂÂÂÂÂÂÂDÂÂÂÂ0ÂÂÂ488ÂÂÂÂÂÂ1 0x00000004
[ÂÂ248.010535] Call Trace:
[ÂÂ248.023567]ÂÂ? __schedule+0x28d/0x870
[ÂÂ248.044236]ÂÂ? __switch_to+0x1f5/0x460
[ÂÂ248.065776]ÂÂschedule+0x32/0x80
[ÂÂ248.084238]ÂÂschedule_timeout+0x1d4/0x2f0
[ÂÂ248.106184]ÂÂwait_for_completion+0x123/0x190
[ÂÂ248.130759]ÂÂ? wake_up_q+0x70/0x70
[ÂÂ248.150295]ÂÂflush_work+0x119/0x1a0
[ÂÂ248.169238]ÂÂ? wake_up_worker+0x30/0x30
[ÂÂ248.189670]ÂÂ__cancel_work_timer+0x103/0x190
[ÂÂ248.213751]ÂÂ? kobj_lookup+0x10b/0x160
[ÂÂ248.235441]ÂÂdisk_block_events+0x6f/0x90
[ÂÂ248.257820]ÂÂ__blkdev_get+0x6a/0x480
[ÂÂ248.278770]ÂÂ? bd_acquire+0xd0/0xd0
[ÂÂ248.298438]ÂÂblkdev_get+0x1a5/0x300
[ÂÂ248.316587]ÂÂ? bd_acquire+0xd0/0xd0
[ÂÂ248.334814]ÂÂdo_dentry_open+0x202/0x320
[ÂÂ248.354372]ÂÂ? security_inode_permission+0x3c/0x50
[ÂÂ248.378818]ÂÂpath_openat+0x537/0x12c0
[ÂÂ248.397386]ÂÂ? vm_insert_page+0x1e0/0x1f0
[ÂÂ248.417664]ÂÂ? vvar_fault+0x75/0x140
[ÂÂ248.435811]ÂÂdo_filp_open+0x91/0x100
[ÂÂ248.454061]ÂÂdo_sys_open+0x126/0x210
[ÂÂ248.472462]ÂÂentry_SYSCALL_64_fastpath+0x1a/0x7d
[ÂÂ248.495438] RIP: 0033:0x7f39e60e1e90
[ÂÂ248.513136] RSP: 002b:00007ffc4c906ba8 EFLAGS: 00000246 ORIG_RAX:
0000000000000002
[ÂÂ248.550726] RAX: ffffffffffffffda RBX: 00005624aead3010 RCX:
00007f39e60e1e90
[ÂÂ248.586207] RDX: 00007f39e60cc0c4 RSI: 0000000000080800 RDI:
00007ffc4c906ed0
[ÂÂ248.622411] RBP: 00007ffc4c906b60 R08: 00007f39e60cc140 R09:
00007f39e60cc140
[ÂÂ248.658704] R10: 000000000000001f R11: 0000000000000246 R12:
00007ffc4c906ed0
[ÂÂ248.695771] R13: 000000009da9d520 R14: 0000000000000000 R15:
00007ffc4c906c28