Re: [PATCH] libsas: flush pending destruct work in sas_unregister_domain_devices()

From: John Garry
Date: Thu Dec 07 2017 - 08:38:23 EST


On 28/11/2017 17:04, Cong Wang wrote:
On Tue, Nov 28, 2017 at 3:18 AM, John Garry <john.garry@xxxxxxxxxx> wrote:
On 28/11/2017 08:20, Johannes Thumshirn wrote:

On Mon, Nov 27, 2017 at 04:24:45PM -0800, Cong Wang wrote:

We saw dozens of the following kernel waring:

WARNING: CPU: 0 PID: 705 at fs/sysfs/group.c:224
sysfs_remove_group+0x54/0x88()
sysfs group ffffffff81ab7670 not found for kobject '6:0:3:0'
Modules linked in: cpufreq_ondemand x86_pkg_temp_thermal coretemp
kvm_intel kvm microcode raid0 iTCO_wdt iTCO_vendor_support sb_edac edac_core
lpc_ich mfd_core ioatdma i2c_i801 shpchp wmi hed acpi_cpufreq lp parport
tcp_diag inet_diag ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel igb ptp
pps_core i2c_algo_bit i2c_core crc32c_intel isci libsas scsi_transport_sas
dca ipv6
CPU: 0 PID: 705 Comm: kworker/u240:0 Not tainted 4.1.35.el7.x86_64 #1


This should by now be fixed with commit fbce4d97fd43 ("scsi: fixup kernel
warning during rmmod()" which went into v4.14-rc6.


Is that the same issue? I think Cong Wang is just trying to deal with the
longstanding libsas hotplug WARN.

Right, we saw it on both 4.1 and 3.14, clearly an old bug.



We at Huawei are still working to fix it. Our patchset is under internal
test at the moment.

As for this patch:
drivers/scsi/libsas/sas_discover.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/libsas/sas_discover.c
b/drivers/scsi/libsas/sas_discover.c
index 60de66252fa2..27c11fc7aa2b 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -388,6 +388,11 @@ void sas_unregister_dev(struct asd_sas_port *port,
struct domain_device *dev)
}
}

+static void sas_flush_work(struct asd_sas_port *port)
+{
+ scsi_flush_work(port->ha->core.shost);
+}
+
void sas_unregister_domain_devices(struct asd_sas_port *port, int gone)
{
struct domain_device *dev, *n;
@@ -401,8 +406,8 @@ void sas_unregister_domain_devices(struct asd_sas_port
*port, int gone)
list_for_each_entry_safe(dev, n, &port->disco_list, disco_list_node)
sas_unregister_dev(port, dev);

+ sas_flush_work(port);

How can this work as sas_unregister_domain_devices() may be called from the
same workqueue which you're trying to flush?


Sorry for slow reply, just remembered this now.


I don't understand, the only caller of sas_unregister_domain_devices()
is sas_deform_port().


And sas_deform_port() may be called from another worker on the same queue, right? As in sas_phye_loss_of_signal()->sas_deform_port()

As I see today, this is the problem callchain:
sas_deform_port()
sas_unregister_domain_devices()
sas_unregister_dev()
sas_discover_event(DISCE_DESTRUCT)

The device destruct takes place in a separate worker from which sas_deform_port() is called, but the same queue. So we have this queued destruct happen after the port is fully deformed -> hence the WARN.

I guess you only tested your patch on disks attached through an expander.

Thanks,
John








.