Re: [bug report] scsi: libsas: Fix hung when disable phys

From: chenxiang (M)
Date: Wed Mar 08 2023 - 20:02:07 EST


Hi,


在 2023/2/27 21:17, yangxingui 写道:

Hi, All

If disabling remote PHY just after disabling all local PHYs in expander
envirnment,as follows:
echo 0 > /sys/class/sas_phy/phy-4\:0/enable
echo 0 > /sys/class/sas_phy/phy-4\:1/enable
echo 0 > /sys/class/sas_phy/phy-4\:2/enable
echo 0 > /sys/class/sas_phy/phy-4\:3/enable
echo 0 > /sys/class/sas_phy/phy-4\:4/enable
echo 0 > /sys/class/sas_phy/phy-4\:5/enable
echo 0 > /sys/class/sas_phy/phy-4\:6/enable
echo 0 > /sys/class/sas_phy/phy-4\:7/enable
echo 0 > /sys/class/sas_phy/phy-4:0:7/enable

a hung as follows occurs.

[ 245.564088] INFO: task kworker/u256:1:883 blocked for more than 120 seconds.
[ 245.571115] Tainted: G O 5.16.0-rc4+ #1
[ 245.576759] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 245.584557] task:kworker/u256:1 state:D stack: 0 pid: 883 ppid: 2 flags:0x00000008
[ 245.592878] Workqueue: 0000:74:02.0_event_q sas_phy_event_worker [libsas]
[ 245.599652] Call trace:
[ 245.602092] __switch_to+0xd8/0x114
[ 245.605574] __schedule+0x2f0/0x85c
[ 245.609054] schedule+0x60/0x100
[ 245.612273] __kernfs_remove.part.0+0x288/0x2e0
[ 245.616791] kernfs_remove_by_name_ns+0x70/0xc0
[ 245.621307] sysfs_remove_file_ns+0x24/0x30
[ 245.625477] device_remove_file+0x24/0x34
[ 245.629475] attribute_container_remove_attrs+0x50/0x8c
[ 245.634684] attribute_container_class_device_del+0x24/0x3c
[ 245.640237] transport_remove_classdev+0x64/0x80
[ 245.644839] attribute_container_device_trigger+0x11c/0x124
[ 245.650393] transport_remove_device+0x24/0x30
[ 245.654823] sas_phy_delete+0x34/0x60
[ 245.658475] do_sas_phy_delete+0x60/0x70
[ 245.662385] device_for_each_child+0x68/0xb0
[ 245.666640] sas_remove_children+0x44/0x54
[ 245.670723] sas_destruct_devices+0x5c/0xa0 [libsas]
[ 245.675676] sas_deform_port+0x178/0x1bc [libsas]
[ 245.680371] sas_phye_loss_of_signal+0x28/0x34 [libsas]
[ 245.685583] sas_phy_event_worker+0x3c/0x60 [libsas]
[ 245.690536] process_one_work+0x1e0/0x46c
[ 245.694534] worker_thread+0x15c/0x464
[ 245.698272] kthread+0x188/0x194
[ 245.701491] ret_from_fork+0x10/0x20
[ 245.705120] INFO: task bash:25579 blocked for more than 120 seconds.
[ 245.711450] Tainted: G O 5.16.0-rc4+ #1
[ 245.717087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 245.724883] task:bash state:D stack: 0 pid:25579 ppid: 25113 flags:0x00000200
[ 245.733202] Call trace:
[ 245.735639] __switch_to+0xd8/0x114
[ 245.739117] __schedule+0x2f0/0x85c
[ 245.742595] schedule+0x60/0x100
[ 245.745814] schedule_timeout+0x180/0x1bc
[ 245.749811] wait_for_completion+0x8c/0x100
[ 245.753984] flush_workqueue+0x108/0x3d4
[ 245.757896] drain_workqueue+0xc8/0x16c
[ 245.761722] __sas_drain_work+0x54/0x90 [libsas]
[ 245.766328] sas_drain_work+0x68/0x70 [libsas]
[ 245.770760] queue_phy_enable+0x9c/0xec [libsas]
[ 245.775368] store_sas_phy_enable+0xf0/0x10c
[ 245.779624] dev_attr_store+0x24/0x40
[ 245.783275] sysfs_kf_write+0x50/0x60
[ 245.786930] kernfs_fop_write_iter+0x124/0x1b4
[ 245.791361] new_sync_write+0xf0/0x190
[ 245.795098] vfs_write+0x23c/0x2a0
[ 245.798490] ksys_write+0x78/0x104
[ 245.801882] __arm64_sys_write+0x28/0x3c
[ 245.805794] invoke_syscall.constprop.0+0x58/0xf0
[ 245.810483] do_el0_svc+0x19c/0x1b0
[ 245.813962] el0_svc+0x28/0xec
[ 245.817009] el0t_64_sync_handler+0x1a8/0x1ac
[ 245.821351] el0t_64_sync+0x1a0/0x1a4

We find that when all local PHYs are disabled, all the devices will be
removed in work PHY_LOSS_OF_SIGNAL which will try to wait the kn->active
of the device to be deactivated (in function kernfs_drain),but
kn->active may be still activated as we use sysfs interface to disable
remote PHYs at the same time, meanwhile it will drain libsas work
including work PHY_LOSS_OF_SIGNAL in the sysfs interface, so hung
occurs.

How to fix the problem in this scenario?

It seems be a common issue in libsas layer.
What about directly calling callback function of phy_enable_work and phy_reset_work in function
queue_phy_enable/queue_phy_reset instead of (queue those works + sas_drain_work)?



regards,

Xingui

.

.