BUG: scsi/qla2xxx: scsi_done from qla2x00_sp_compl race with scsi_queue_insert from abort handler

From: jianchao.wang
Date: Wed Nov 08 2017 - 01:29:32 EST


Hi

[1.] One line summary of the problem:
scsi_done from qla2x00_sp_compl race with scsi_queue_insert from scmd_eh_abort_handler()
then cause the BUG_ON(blk_queued_rq(req)) trigger.

[2.] Full description of the problem/report:
The detailed scene is as following:
cpu A cpu B
kworker interrupt

scmd_eh_abort_handler()
-> scsi_try_to_abort_cmd() qla2x00_status_entry()
-> qla2xxx_eh_abort() -> res = DID_RESET << 16 //case CS_ABORTED
-> qla2x00_eh_wait_on_command() -> qla2x00_sp_compl()
>>>> +-> cmd->result = res; // --->P0_B
while (CMD_SP(cmd) && wait_iter--) { +-> qla2x00_sp_free_dma()
msleep(ABORT_POLLING_PERIOD); | -> CMD_SP(cmd) = NULL
} |
>>>> |
-> scsi_noretry_cmd() //--->P2_A |
-> scsi_queue_insert() +-> scsi_done()
-> __scsi_queue_insert() -> blk_complete_request()
-> cmd->result = 0; // --->P0_A |
-> blk_requeue_request() |
-> blk_clear_rq_complete() // --->P1_A |
-> elv_requeue_request() |
-> __elv_add_request() +-> blk_mark_rq_complete() // ----> P1_B
// req will be queued here +-> __blk_complete_request()
BLK_SOFTIRQ
scsi_softirq_done()
-> scsi_decide_disposition() // ----> P2_B
// scsi_cmnd->result is set to 0 in P0_A
// so it will return SUCCESS
-> scsi_finish_command()
-> scsi_io_completion()
-> scsi_end_request()
-> blk_finish_request() // BUG_ON(blk_queued_rq(req)) !!!

The scsi_cmd->result was set to DID_RESET << 16 at P0_B.
At P2_A, the scsi_noretry_cmd() would return 0 and scsi_queue_insert() would be invoked.
When the blk_clear_rq_complete() at P1_A was invoked before the blk_mark_rq_complete()
at P1_B, __blk_complete_request() would be invoked in cpu B.
At P2_B, the scsi_decide_disposition() will return SUCCESS as the scsi_cmd->result had been set to 0 at P0_A.
Finally, the BUG_ON(blk_queued_rq(req)) in blk_finish_request() was triggered.

Whether the scsi_done should be invoked in qla2xxx_sp_compl() when scsi_cmd is aborted ?
When the P1_B is before P1_A, it will do nothing, otherwise, the panic comes up.

Thanks
Jianchao