Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

From: Hannes Reinecke
Date: Thu Oct 12 2023 - 10:50:57 EST


On 4/6/22 11:40, Wenchao Hao wrote:
On 2022/4/4 13:28, Hannes Reinecke wrote:
On 4/3/22 19:17, Mike Christie wrote:
On 4/3/22 12:14 PM, Mike Christie wrote:
We could share code with scsi_ioctl_reset as well. Drivers that support
TMFs via that ioctl already expect queuecommand to be possibly in the
middle of a run and IO not yet timed out. For example, the code to
block a queue and reset the device could be used for the new EH and
SG_SCSI_RESET_DEVICE handling.


Hannes or others,

How do parallel SCSI drivers support scsi_ioctl_reset? Is is not fully
supported and more only used for controlled testing?

That's actually a problem in scsi_ioctl_reset(); it really should wait
for all I/O to quiesce. Currently it just sets the 'tmf' flag and calls
into the various reset functions.

But really, I'd rather get my EH rework in before we're start discussing
modifying EH behaviour.
Let me repost it ...


Would you take fast EH(such as single LUN reset) into consideration, maybe
a second but lightweight EH? It means a lot.

Or give a way drivers can branch out the general timeout and EH handle logic?

(Re-reading the thread:)

If it's just about device reset I guess we can implement an asynchronous version. Based on my EH rework we could / should do:

Have a 'eh_cmd_q' list per 'struct scsi_device' and 'struct
scsi_target'. So Instead of always moving a failed command to the
'eh_cmq_q' list of the host, move it onto the list of the next higher
level (eg a failed abort would move it to the eh_cmq_q of 'struct
scsi_device', a failed device reset would move it to the eh_cmq_q of
'struct scsi_target' etc).
That would actually make the code in SCSI EH easier to read as we
could do away with constantly moving and splitting the per-host
eh_cmq_q list.

And then, as a second step, implement a new eh callback for
asynchronous SCSI device aborts. That callback would need to
stop I/O to the device first, send the TMF, and either
restart the device upon successful completion or splice
the list of failed commands onto the target and call
the normal escalation with skipping eh_device_reset().

Hmm?

Cheers,

Hannes