Re: [RFC PATCH v2 00/18] scsi: scsi_error: Introduce new error handle mechanism

From: Wenchao Hao
Date: Wed Sep 27 2023 - 05:39:46 EST


On 2023/9/27 1:37, Mike Christie wrote:
On 9/26/23 7:57 AM, Wenchao Hao wrote:
On 2023/9/26 1:54, Mike Christie wrote:
On 9/25/23 10:07 AM, Wenchao Hao wrote:
On 2023/9/25 22:55, Christoph Hellwig wrote:
Before we add another new error handling mechanism we need to fix the
old one first.  Hannes' work on not passing the scsi_cmnd to the various
reset handlers hasn't made a lot of progress in the last five years and
we'll need to urgently fix that first before adding even more
complexity.

I observed Hannes's patches posted about one year ago, it has not been
applied yet. I don't know if he is still working on it.

My patches do not depend much on that work, I think the conflict can be
solved fast between two changes.

I think we want to figure out Hannes's patches first.

For a new EH design we will want to be able to do multiple TMFs in parallel
on the same host/target right?


It's not necessary to do multiple TMFs in parallel, it's ok to make sure
each TMFs do not affect each other.

For example, we have two devices: 0:0:0:0 and 0:0:0:1

Both of them request device reset, they do not happened in parallel, but
would in serial. If 0:0:0:0 is performing device reset in progress, 0:0:0:1
just wait 0:0:0:0 to finish.

I see. I guess we still get the benefit of not having to stop other devices
when doing TMFs.


Yes, it's better to support multiple TMFs in parallel than just run in serial.
I would wait for Hannes's changes to be applied and send my change again.

I think we still want a common way to allocate/free and manage resources
drivers will use during this time. Maybe have a init_device/target/cmd/eh_priv
and exit_device/target/eh_priv (I'm not sure of the name, but something similar
to the init_cmd_priv/exit_cmd_priv we have for normal commands.

scsi-ml then calls into the new eh with the priv data. Drivers don't have to
do the preallocation and worry if it's per device/target/host.

I'm not 100% sure about the low level details. Check out how Hannes's is
handling tag management for TMFs as well.



The problem is that we need to be able to make forward progress in the EH
path and not fail just because we can't allocate memory for a TMF related
struct. To accomplish this now, drivers will use mempools, preallocate TMF
related structs/mem/tags with their scsi_cmnd related structs, preallocate
per host/target/device related structs or ignore what I wrote above and just
fail.

Hannes's patches fix up the eh callouts so they don't pass in a scsi_cmnd
when it's not needed. That seems nice because after that, then for your new
EH we can begin to standardize on how to handle preallocation of drivers
resources needed to perform TMFs for your new EH. It could be a per
device/target/host callout to allow drivers to preallocate, then scsi-ml calls
into the drivers with that data. It doesn't have to be exactly like that or
anything close. It would be nice for drivers to not have to think about this
type of thing and scsi-ml just to handle the resource management for us when
there are multiple TMFs in progress.