Re: kernel BUG at drivers/scsi/scsi_error.c:197! - git 4.17.0-x64-08428-g7d3bf613e99a

From: jianchao.wang
Date: Wed Jun 13 2018 - 00:04:07 EST


This is a multi-part message in MIME format.

On 06/12/2018 11:28 PM, Bart Van Assche wrote:
> On Sat, 2018-06-09 at 08:02 -0700, Randy Dunlap wrote:
>> On 06/09/2018 06:06 AM, Andrew Randrianasulu wrote:
>>> Hello, everyone!
>>>
>>> I was trying to test not-yet-completed 4.18-rc1, and run into this bug, probably
>>> due to HAL (yes, I still use this) trying to poll empty DVD-RW drive:
>>>
>>> [ 35.040167] ------------[ cut here ]------------
>>> [ 35.040171] kernel BUG at drivers/scsi/scsi_error.c:197!
>>> [ 35.040196] invalid opcode: 0000 [#1] SMP NOPTI
>>> [ 35.040210] CPU: 2 PID: 117 Comm: kworker/2:1H Not tainted
>>> 4.17.0-x64-08428-g7d3bf613e99a-dirty #18
>>> [ 35.040225] Hardware name: MSI MS-7693/970A-G46 (MS-7693), BIOS V2.8
>>> 01/08/2016
>>> [ 35.040244] Workqueue: kblockd blk_mq_timeout_work
>>> [ 35.040283] RIP: 0010:scsi_times_out+0xe9/0x1a6 [scsi_mod]
>>> [ 35.040287] Code: 03 76 18 48 c7 c2 f9 c2 01 a0 48 89 ee 48 c7 c7 14 be 01 a0
>>> 31 c0 e8 76 ad 00 00 48 8b 83 98 01 00 00 a8 01 0f 84 a5 00 00 00 <0f> 0b 49 8b
>>> 7c 24 38 e8 8f ba 4b e1 41 83 bc 24 2c 01
>>> 00 00 ff 74
>>> [ 35.040312] RSP: 0018:ffffc90000293d98 EFLAGS: 00010202
>>> [ 35.040318] RAX: ffff88031f240605 RBX: ffff88031eed8500 RCX: 00000000fffefd40
>>> [ 35.040323] RDX: ffffc90000293e80 RSI: 0000000000002007 RDI: ffff88031eed8638
>>> [ 35.040327] RBP: ffff88031eed8638 R08: ffff88031eed85d8 R09: 0000000000000000
>>> [ 35.040332] R10: 0000000000000040 R11: fefefefefefefeff R12: ffff88031f2a8800
>>> [ 35.040336] R13: ffff88031f2a8800 R14: ffff88031f29e910 R15: 0000000000000001
>>> [ 35.040341] FS: 0000000000000000(0000) GS:ffff88032ed00000(0000)
>>> knlGS:0000000000000000
>>> [ 35.040346] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 35.040350] CR2: 00000000f7ea83c0 CR3: 000000031e194000 CR4: 00000000000406e0
>>> [ 35.040354] Call Trace:
>>> [ 35.040360] blk_mq_check_expired+0xdc/0x10c
>>> [ 35.040365] bt_iter+0x42/0x45
>>> [ 35.040369] __sbitmap_for_each_set.constprop.12+0x83/0xac
>>> [ 35.040374] ? blk_mq_update_nr_requests+0xad/0xad
>>> [ 35.040595] ? blk_mq_exit_hctx+0xda/0xda
>>> [ 35.041595] blk_mq_queue_tag_busy_iter+0xa3/0xb4
>>> [ 35.042578] ? blk_mq_exit_hctx+0xda/0xda
>>> [ 35.043550] blk_mq_timeout_work+0x6b/0xca
>>> [ 35.044505] process_one_work+0x17c/0x2a6
>>> [ 35.045445] worker_thread+0x19f/0x243
>>> [ 35.046393] ? rescuer_thread+0x262/0x262
>>> [ 35.047346] kthread+0x100/0x108
>>> [ 35.048290] ? kthread_destroy_worker+0x3e/0x3e
>>> [ 35.049242] ret_from_fork+0x27/0x50
>>> [ 35.050192] Modules linked in: bridge stp llc bnep rfcomm hidp snd_aloop ipv6
.....
> Is this something that happens systematically or sporadically? Additionally,
> have you verified whether or not this can be triggered with kernel v4.17? I'm
> wondering whether or not this is a regression that was introduced in the merge
> window.
>

I suspect this is due to we could expire a same request twice or even more.
For scsi mid-layer, it return BLK_EH_DONE from .timeout, in fact, the request is not
completed there, but just queue a delayed abort_work (HZ/100). If the blk_mq_timeout_work
runs again before the abort_work, the request will be timed out again, because there is not
any mark on it to identify this request has been timed out.

Would please try the patch attached on to see whether this issue could be fixed ?
(this patch only works for scsi device currently)

Thanks
Jianchao