Re: [PATCH v2] scsi: libsas: Fix exp-attached end device cannot be scanned in again after probe failed

From: John Garry
Date: Fri May 24 2024 - 04:37:15 EST


On 24/04/2024 09:08, Xingui Yang wrote:
We found that it is judged as broadcast flutter when the exp-attached end
device reconnects after probe failed, as follows:

[78779.654026] sas: broadcast received: 0
[78779.654037] sas: REVALIDATING DOMAIN on port 0, pid:10
[78779.654680] sas: ex 500e004aaaaaaa1f phy05 change count has changed
[78779.662977] sas: ex 500e004aaaaaaa1f phy05 originated BROADCAST(CHANGE)
[78779.662986] sas: ex 500e004aaaaaaa1f phy05 new device attached
[78779.663079] sas: ex 500e004aaaaaaa1f phy05:U:8 attached: 500e004aaaaaaa05 (stp)
[78779.693542] hisi_sas_v3_hw 0000:b4:02.0: dev[16:5] found
[78779.701155] sas: done REVALIDATING DOMAIN on port 0, pid:10, res 0x0
[78779.707864] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
...
[78835.161307] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[78835.171344] sas: sas_probe_sata: for exp-attached device 500e004aaaaaaa05 returned -19
[78835.180879] hisi_sas_v3_hw 0000:b4:02.0: dev[16:5] is gone
[78835.187487] sas: broadcast received: 0
[78835.187504] sas: REVALIDATING DOMAIN on port 0, pid:10
[78835.188263] sas: ex 500e004aaaaaaa1f phy05 change count has changed
[78835.195870] sas: ex 500e004aaaaaaa1f phy05 originated BROADCAST(CHANGE)
[78835.195875] sas: ex 500e004aaaaaaa1f rediscovering phy05
[78835.196022] sas: ex 500e004aaaaaaa1f phy05:U:A attached: 500e004aaaaaaa05 (stp)
[78835.196026] sas: ex 500e004aaaaaaa1f phy05 broadcast flutter
[78835.197615] sas: done REVALIDATING DOMAIN on port 0, pid:10, res 0x0

The cause of the problem is that the related ex_phy's attached_sas_addr was
not cleared after the end device probe failed. In order to solve the above
problem, a function sas_ex_unregister_end_dev() is defined to clear the
ex_phy information and unregister the end device after the exp-attached end
device probe failed.

As the sata device is an asynchronous probe, the sata device may probe
failed after done REVALIDATING DOMAIN. Then after its port is added to the
sas_port_del_list, the port will not be deleted until the end of the next
REVALIDATING DOMAIN and sas_destruct_ports() is called. A warning about
creating a duplicate port will occur in the new REVALIDATING DOMAIN when
the end device reconnects. Therefore, the previous destroy_list and
sas_port_del_list should be handled before REVALIDATING DOMAIN.

Signed-off-by: Xingui Yang <yangxingui@xxxxxxxxxx>
---
Changes since v1:
- Simplify the process of getting ex_phy id based on Jason's suggestion.
- Update commit information.
---
drivers/scsi/libsas/sas_discover.c | 2 ++
drivers/scsi/libsas/sas_expander.c | 8 ++++++++
drivers/scsi/libsas/sas_internal.h | 6 +++++-
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c
index 8fb7c41c0962..aae90153f4c6 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -517,6 +517,8 @@ static void sas_revalidate_domain(struct work_struct *work)
struct sas_ha_struct *ha = port->ha;
struct domain_device *ddev = port->port_dev;
+ sas_destruct_devices(port);
+ sas_destruct_ports(port);

We still have both these same calls at the @out label - is that as desired?

Why do these new additions not cover the same job which those calls to the same functions @out covers?

/* prevent revalidation from finding sata links in recovery */
mutex_lock(&ha->disco_mutex);
if (test_bit(SAS_HA_ATA_EH_ACTIVE, &ha->state)) {
diff --git a/drivers/scsi/libsas/sas_expander.c b/drivers/scsi/libsas/sas_expander.c
index f6e6db8b8aba..45793c10009b 100644
--- a/drivers/scsi/libsas/sas_expander.c
+++ b/drivers/scsi/libsas/sas_expander.c
@@ -1856,6 +1856,14 @@ static void sas_unregister_devs_sas_addr(struct domain_device *parent,
}
}
+void sas_ex_unregister_end_dev(struct domain_device *dev)
+{
+ struct domain_device *parent = dev->parent;
+ struct sas_phy *phy = dev->phy;
+
+ sas_unregister_devs_sas_addr(parent, phy->number, true);
+}
+
static int sas_discover_bfs_by_root_level(struct domain_device *root,
const int level)
{
diff --git a/drivers/scsi/libsas/sas_internal.h b/drivers/scsi/libsas/sas_internal.h
index 3804aef165ad..434f928c2ed8 100644
--- a/drivers/scsi/libsas/sas_internal.h
+++ b/drivers/scsi/libsas/sas_internal.h
@@ -50,6 +50,7 @@ void sas_discover_event(struct asd_sas_port *port, enum discover_event ev);
void sas_init_dev(struct domain_device *dev);
void sas_unregister_dev(struct asd_sas_port *port, struct domain_device *dev);
+void sas_ex_unregister_end_dev(struct domain_device *dev);
void sas_scsi_recover_host(struct Scsi_Host *shost);
@@ -145,7 +146,10 @@ static inline void sas_fail_probe(struct domain_device *dev, const char *func, i
func, dev->parent ? "exp-attached" :
"direct-attached",
SAS_ADDR(dev->sas_addr), err);
- sas_unregister_dev(dev->port, dev);
+ if (dev->parent && !dev_is_expander(dev->dev_type))

This check looks odd.

So we're checking if we have a parent device and we are not an expander, right?

+ sas_ex_unregister_end_dev(dev);
+ else
+ sas_unregister_dev(dev->port, dev);
}
static inline void sas_fill_in_rphy(struct domain_device *dev,