In Qian's kernel .config, async scsi scan is disabled so in failure
case SCSI scan type is synchronous.
Below is the stack trace when scsi_scan_host() hangs:
[<0>] __wait_rcu_gp+0x134/0x170
[<0>] synchronize_rcu.part.80+0x53/0x60
[<0>] blk_free_flush_queue+0x12/0x30
[<0>] blk_mq_hw_sysfs_release+0x21/0x70
[<0>] kobject_release+0x46/0x150
[<0>] blk_mq_release+0xb4/0xf0
[<0>] blk_release_queue+0xc4/0x130
[<0>] kobject_release+0x46/0x150
[<0>] scsi_device_dev_release_usercontext+0x194/0x3f0
[<0>] execute_in_process_context+0x22/0xa0
[<0>] device_release+0x2e/0x80
[<0>] kobject_release+0x46/0x150
[<0>] scsi_alloc_sdev+0x2e7/0x310
[<0>] scsi_probe_and_add_lun+0x410/0xbd0
[<0>] __scsi_scan_target+0xf2/0x530
[<0>] scsi_scan_channel.part.7+0x51/0x70
[<0>] scsi_scan_host_selected+0xd4/0x140
[<0>] scsi_scan_host+0x198/0x1c0
This issue hits when lock related debugging is enabled in kernel config.
kernel .config parameters(may be subset of this list) are required to
hit the issue:
CONFIG_PREEMPT_COUNT=y *
CONFIG_UNINLINE_SPIN_UNLOCK=y *
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_RT_MUTEXES=y *
CONFIG_DEBUG_SPINLOCK=y *
CONFIG_DEBUG_MUTEXES=y *
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y *
CONFIG_DEBUG_RWSEMS=y *
CONFIG_DEBUG_LOCK_ALLOC=y *
CONFIG_LOCKDEP=y *
CONFIG_DEBUG_LOCKDEP=y
CONFIG_TRACE_IRQFLAGS=y *
CONFIG_TRACE_IRQFLAGS_NMI=y
CONFIG_DEBUG_KOBJECT=y CONFIG_PROVE_RCU=y *
CONFIG_PREEMPTIRQ_TRACEPOINTS=y *
When scsi_scan_host() hangs, there are no outstanding IOs with
megaraid_sas driver-firmware stack as SCSI "host_busy" counter and
megaraid_sas driver's internal counter are "0".
Key takeaways:
1. Issue is observed when lock related debugging is enabled so issue
is seen in debug environment.
2. Issue seems to be related to generic shared "host_tagset" code
whenever some kind of kernel debugging is enabled. We do not see an
immediate reason to hide this issue through disabling the
"host_tagset" feature.
John,
Issue may hit on ARM platform too using Qian's .config file with other
adapters (e.g. hisi_sas) as well. So I feel disabling “host_tagset” in
megaraid_sas driver will not help. It requires debugging from the
“Entire Shared host tag feature” perspective as scsi_scan_host()
waittime aggravates when "host_tagset" is enabled. Also, I am doing
parallel debugging and if I find anything useful, I will share.
Qian,
I need full dmesg logs from your setup with
megaraid_sas.host_tagset_enable=1 and
megaraid_sas.host_tagset_enable=0. Please wait for a long time. I just
want to make sure that whatever you observe is the same as mine.