Re: [PATCH v5 1/3] scsi: core: Add new helper to iterate all devices of host

From: Wenchao Hao
Date: Thu Jun 13 2024 - 03:10:28 EST


On 2024/6/13 14:27, Hannes Reinecke wrote:
On 6/12/24 17:06, Wenchao Hao wrote:
On 6/12/24 4:33 PM, Hannes Reinecke wrote:
On 6/5/24 11:17, Wenchao Hao wrote:
shost_for_each_device() would skip devices which is in SDEV_CANCEL or
SDEV_DEL state, for some scenarios, we donot want to skip these devices,
so add a new macro shost_for_each_device_include_deleted() to handle it.

Following changes are introduced:

1. Rework scsi_device_get(), add new helper __scsi_device_get() which
     determine if skip deleted scsi_device by parameter "skip_deleted".
2. Add new parameter "skip_deleted" to __scsi_iterate_devices() which
     is used when calling __scsi_device_get()
3. Update shost_for_each_device() to call __scsi_iterate_devices() with
     "skip_deleted" true
4. Add new macro shost_for_each_device_include_deleted() which call
     __scsi_iterate_devices() with "skip_deleted" false

Signed-off-by: Wenchao Hao <haowenchao22@xxxxxxxxx>
---
   drivers/scsi/scsi.c        | 46 ++++++++++++++++++++++++++------------
   include/scsi/scsi_device.h | 25 ++++++++++++++++++---
   2 files changed, 54 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 3e0c0381277a..5913de543d93 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -735,20 +735,18 @@ int scsi_cdl_enable(struct scsi_device *sdev, bool enable)
       return 0;
   }
   -/**
- * scsi_device_get  -  get an additional reference to a scsi_device
+/*
+ * __scsi_device_get  -  get an additional reference to a scsi_device
    * @sdev:    device to get a reference to
- *
- * Description: Gets a reference to the scsi_device and increments the use count
- * of the underlying LLDD module.  You must hold host_lock of the
- * parent Scsi_Host or already have a reference when calling this.
- *
- * This will fail if a device is deleted or cancelled, or when the LLD module
- * is in the process of being unloaded.
+ * @skip_deleted: when true, would return failed if device is deleted
    */
-int scsi_device_get(struct scsi_device *sdev)
+static int __scsi_device_get(struct scsi_device *sdev, bool skip_deleted)
   {
-    if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)
+    /*
+     * if skip_deleted is true and device is in removing, return failed
+     */
+    if (skip_deleted &&
+        (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL))
           goto fail;

Nack.
SDEV_DEL means the device is about to be deleted, so we _must not_ access it at all.


Sorry I added SDEV_DEL here at hand without understanding what it means.
Actually, just include scsi_device which is in SDEV_CANCEL would fix the
issues I described.

The issues are because device removing concurrent with error handle.
Normally, error handle would not be triggered when scsi_device is in
SDEV_DEL. Below is my analysis, if it is wrong, please correct me.

If there are scsi_cmnd remain unfinished when removing scsi_device,
the removing process would waiting for all commands to be finished.
If commands error happened and trigger error handle, the removing
process would be blocked until error handle finished, because
__scsi_remove_device called  del_gendisk() which would wait all
requests to be finished. So now scsi_device is in SDEV_CANCEL.

If the scsi_device is already in SDEV_DEL, then no scsi_cmnd has been
dispatched to this scsi_device, then error handle would never triggered.

I want to change the new function __scsi_device_get() as following,
please help to review.

/*
  * __scsi_device_get  -  get an additional reference to a scsi_device
  * @sdev:    device to get a reference to
  * @skip_canceled: when true, would return failed if device is deleted
  */
static int __scsi_device_get(struct scsi_device *sdev, bool skip_canceled)
{
    /*
     * if skip_canceled is true and device is in removing, return failed
     */
    if (sdev->sdev_state == SDEV_DEL ||
        (sdev->sdev_state == SDEV_CANCEL && skip_canceled))
        goto fail;
    if (!try_module_get(sdev->host->hostt->module))
        goto fail;
    if (!get_device(&sdev->sdev_gendev))
        goto fail_put_module;
    return 0;

fail_put_module:
    module_put(sdev->host->hostt->module);
fail:
    return -ENXIO;
}

I don't think that's required.
With your above analysis, wouldn't the problem be solved with:

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 775df00021e4..911fcfa80d69 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1470,6 +1470,8 @@ void __scsi_remove_device(struct scsi_device *sdev)
        if (sdev->sdev_state == SDEV_DEL)
                return;

+       scsi_block_when_processing_errors(sdev);
+
        if (sdev->is_visible) {
                /*
                 * If scsi_internal_target_block() is running concurrently,

Hmm?


We can not make sure no scsi_cmnd remain unfinished after scsi_block_when_processing_errors(). For example, there is a
command has been dispatched but it's not timeouted when removing
device, no error happened. After scsi_device is set to SDEV_CANCEL,
the removing process would be blocked by del_gendisk() because there
is still a request.

Then the request timeout and abort failed, error handle would be triggered, now scsi_device is SDEV_CANCEL.

The error handle would skip this scsi_device when doing device reset.

Cheers,

Hannes