Re: [PATCH] ibmvscsi: fix race potential race after loss of transport

From: Tyrel Datwyler
Date: Wed Oct 28 2020 - 21:28:45 EST


On 10/27/20 10:21 PM, Michael Ellerman wrote:
> Tyrel Datwyler <tyreld@xxxxxxxxxxxxx> writes:
>> After a loss of tranport due to an adatper migration or crash/disconnect from
>> the host partner there is a tiny window where we can race adjusting the
>> request_limit of the adapter. The request limit is atomically inc/dec to track
>> the number of inflight requests against the allowed limit of our VIOS partner.
>> After a transport loss we set the request_limit to zero to reflect this state.
>> However, there is a window where the adapter may attempt to queue a command
>> because the transport loss event hasn't been fully processed yet and
>> request_limit is still greater than zero. The hypercall to send the event will
>> fail and the error path will increment the request_limit as a result. If the
>> adapter processes the transport event prior to this increment the request_limit
>> becomes out of sync with the adapter state and can result in scsi commands being
>> submitted on the now reset connection prior to an SRP Login resulting in a
>> protocol violation.
>>
>> Fix this race by protecting request_limit with the host lock when changing the
>> value via atomic_set() to indicate no transport.
>>
>> Signed-off-by: Tyrel Datwyler <tyreld@xxxxxxxxxxxxx>
>> ---
>> drivers/scsi/ibmvscsi/ibmvscsi.c | 36 +++++++++++++++++++++++---------
>> 1 file changed, 26 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
>> index b1f3017b6547..188ed75417a5 100644
>> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
>> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
>> @@ -806,6 +806,22 @@ static void purge_requests(struct ibmvscsi_host_data *hostdata, int error_code)
>> spin_unlock_irqrestore(hostdata->host->host_lock, flags);
>> }
>>
>> +/**
>> + * ibmvscsi_set_request_limit - Set the adapter request_limit in response to
>> + * an adapter failure, reset, or SRP Login. Done under host lock to prevent
>> + * race with scsi command submission.
>> + * @hostdata: adapter to adjust
>> + * @limit: new request limit
>> + */
>> +static void ibmvscsi_set_request_limit(struct ibmvscsi_host_data *hostdata, int limit)
>> +{
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(hostdata->host->host_lock, flags);
>> + atomic_set(&hostdata->request_limit, limit);
>> + spin_unlock_irqrestore(hostdata->host->host_lock, flags);
>> +}
>> +
>> /**
>> * ibmvscsi_reset_host - Reset the connection to the server
>> * @hostdata: struct ibmvscsi_host_data to reset
> ...
>> @@ -2137,12 +2153,12 @@ static void ibmvscsi_do_work(struct ibmvscsi_host_data *hostdata)
>> }
>>
>> hostdata->action = IBMVSCSI_HOST_ACTION_NONE;
>> + spin_unlock_irqrestore(hostdata->host->host_lock, flags);
>
> You drop the lock ...
>
>> if (rc) {
>> - atomic_set(&hostdata->request_limit, -1);
>> + ibmvscsi_set_request_limit(hostdata, -1);
>
> .. then retake it, then drop it again in ibmvscsi_set_request_limit().
>
> Which introduces the possibility that something else gets the lock
> before you can set the limit to -1.
>
> I'm not sure that's a bug, but it's not obviously correct either?

Yeah, I'd already moved the request_limit update into its own function before I
got to this case which made me a bit uneasy when I realized I had to drop the
lock because my new function takes the lock. However, we only need to protect
ourselves from from racing with queuecommand() which is locked for its entire
call. Further, if we've gotten here it means we were either resetting or
re-enabling the adapter which would have already set request_limit to zero. At
this point the transport was already gone and we've further failed to reset it.
Also, we've blocked any new scsi requests at this point.

-Tyrel

>
> cheers
>
>> dev_err(hostdata->dev, "error after %s\n", action);
>> }
>> - spin_unlock_irqrestore(hostdata->host->host_lock, flags);
>>
>> scsi_unblock_requests(hostdata->host);
>> }