Re: [PATCH] scsi: Avoid potential infinite eh_timeout_handler() loop

From: James Bottomley
Date: Thu Jun 04 2015 - 16:27:32 EST


On Thu, 2015-06-04 at 11:40 -0700, Rajat Jain wrote:
> Each cmd timeout should result in scmd->retries++. Currently it happens
> just only before a command is requeued back. However, if the LLD
> eh_timed_out() handler asks to reset timer back again, then also it should
> be incremented because effectively LLD will be given a full time period
> (SD_TIMEOUT = 30 secs!) to attempt to complete the command.
>
> Why this is a problem:
>
> => Currently the SCSI low level transport drivers can provide
> eh_timeout_handler() calls (for e.g. iscsi provides this) to deal
> with command timeouts.
>
> => The eh_timeout_handler() can return BLK_EH_RESET_TIMER that causes the
> SCSI / block layer to reset the timer, thus giving more time to the
> LLD.
>
> => Currently a LLD can potentially loop infinitely on a command if it
> always keeps on returning BLK_EH_RESET_TIMER.
>
> * => Other than choking its own devices, if the command that is stuck is a
> command issued during sd_probe_async() (e.g. a partition table scan),
> then it impacts all the disks because no other disks can be removed
> from the system until sd_probe_async() returns. (sd_remove waits on
> async_synchronize_full_domain(...))
>
> => This problem actually resulted in the situation mentioned above,
> whereby no disks in the system (on other scsi hosts) could be removed,
> because of a stuck scsi command to read the partition tables of an
> unrelated problematic disk during probe. The threads were stuck at:
>
> schedule+0x312/0x7a0
> async_synchronize_cookie_domain+0xb8/0x115
> ? __wake_up_bit+0x40/0x40
> async_synchronize_full_domain+0x15/0x17
> sd_remove+0x5f/0x135
> __device_release_driver+0x8a/0xe0
> device_release_driver+0x23/0x30
> bus_remove_device+0x10f/0x123
> device_del+0x132/0x18e
> __scsi_remove_device+0x56/0xb6
> scsi_remove_device+0x26/0x33
> scsi_remove_target+0x12d/0x1a0
> ...
>
> What this patch does:
> => Ensure that any quests to reset the timer are accounted for, so that
> there is a finite upper bound on the time that a command is tried.
> Once allowed number of retries is reached, we proceed to standard
> error handling procedure (abort etc.) by scheduling the command
> for EH.
>
> Signed-off-by: Rajat Jain <rajatja@xxxxxxxxxx>

This is actually wrong. Originally the code you're suggesting did exist
and it used to cause us to time out far too early on some conditions.
Now scmd->retries is for specific things that shouldn't be retried too
often. Anything else appears to retry forever but in fact there's a
specific check (in the softirq and io_completion) to check that a
retryable failure hasn't taken longer than (cmd->allowed + 1) *
req->timeout.

This means effectively that nothing in SCSI is allowed to retry forever.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/