panic with CPU hotplug + blk-mq + scsi-mq

From: Dongsu Park
Date: Fri Apr 17 2015 - 05:42:11 EST


Hi,

there's a critical bug regarding CPU hotplug, blk-mq, and scsi-mq.
Every time when a CPU is offlined, some arbitrary range of kernel memory
seems to get corrupted. Then after a while, kernel panics at random places
when block IOs are issued. (for example, see the call traces below)

This bug can be easily reproducible with a Qemu VM running with virtio-scsi,
when its guest kernel is 3.19-rc1 or higher, and when scsi-mq is loaded
with blk-mq enabled. And yes, 4.0 release is still affected, as well as
Jens' for-4.1/core. How to reproduce:

# echo 0 > /sys/devices/system/cpu/cpu1/online
(and issue some block IOs, that's it.)

Bisecting between 3.18 and 3.19-rc1, it looks like this bug had been hidden
until commit ccbedf117f01 ("virtio_scsi: support multi hw queue of blk-mq"),
which started to allow virtio-scsi to map virtqueues to hardware queues of
blk-mq. Reverting that commit makes the bug go away. However, I suppose
reverting it could not be a correct solution.

More precisely, every time a CPU hotplug event gets triggered,
a call graph is like the following:

blk_mq_queue_reinit_notify()
-> blk_mq_queue_reinit()
-> blk_mq_map_swqueue()
-> blk_mq_free_rq_map()
-> scsi_exit_request()

>From that point, as soon as any address in the request gets modified, an
arbitrary range of memory gets corrupted. My first guess was that probably
the exit routine could try to deallocate tags->rqs[] where invalid
addresses are stored. But actually it looks like it's not the case,
and cmd->sense_buffer looks also valid.
It's not obvious to me, exactly what could go wrong.

Does anyone have an idea?

Regards,
Dongsu

---- [beginning of call traces] ----
[ 47.274292] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 47.275013] IP: [<ffffffff8140b31d>] __bt_get.isra.5+0x7d/0x1e0
[ 47.275013] PGD 79c55067 PUD 7ba17067 PMD 0
[ 47.275013] Oops: 0000 [#1] SMP
[ 47.275013] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache dm_round_robin loop dm_multipath 9pnet_virtio rtc_cmos 9pnet acpi_cpufreq serio_raw i2c_piix4 virtio_net
[ 47.275013] CPU: 3 PID: 6232 Comm: blkid Not tainted 4.0.0 #303
[ 47.275013] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
[ 47.275013] task: ffff88003dfbc020 ti: ffff880079bac000 task.ti: ffff880079bac000
[ 47.275013] RIP: 0010:[<ffffffff8140b31d>] [<ffffffff8140b31d>] __bt_get.isra.5+0x7d/0x1e0
[ 47.275013] RSP: 0018:ffff880079baf898 EFLAGS: 00010246
[ 47.275013] RAX: 000000000000003c RBX: ffff880079198400 RCX: 0000000000000078
[ 47.275013] RDX: ffff88007fddbb80 RSI: 0000000000000010 RDI: ffff880079198400
[ 47.275013] RBP: ffff880079baf8e8 R08: ffff88007fddbb80 R09: 0000000000000000
[ 47.275013] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000010
[ 47.275013] R13: 0000000000000010 R14: ffff880079baf9e8 R15: ffff88007fddbb80
[ 47.275013] FS: 00002b270c049800(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 47.275013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 47.275013] CR2: 0000000000000018 CR3: 000000007ca8d000 CR4: 00000000001407e0
[ 47.275013] Stack:
[ 47.275013] ffff880079baf978 ffff88007fdd58c0 0000000000000078 ffffffff814071ff
[ 47.275013] ffff880079baf8d8 ffff880079198400 0000000000000010 0000000000000010
[ 47.275013] ffff880079baf9e8 ffff88007fddbb80 ffff880079baf968 ffffffff8140b4e5
[ 47.275013] Call Trace:
[ 47.275013] [<ffffffff814071ff>] ? blk_mq_queue_enter+0x9f/0x2d0
[ 47.275013] [<ffffffff8140b4e5>] bt_get+0x65/0x1e0
[ 47.275013] [<ffffffff814071ff>] ? blk_mq_queue_enter+0x9f/0x2d0
[ 47.275013] [<ffffffff810c9b40>] ? wait_woken+0xa0/0xa0
[ 47.275013] [<ffffffff8140ba07>] blk_mq_get_tag+0xa7/0xd0
[ 47.275013] [<ffffffff8140630b>] __blk_mq_alloc_request+0x1b/0x200
[ 47.275013] [<ffffffff81408736>] blk_mq_map_request+0xd6/0x4e0
[ 47.275013] [<ffffffff8140a53e>] blk_mq_make_request+0x6e/0x2d0
[ 47.275013] [<ffffffff813fb844>] ? generic_make_request_checks+0x674/0x6a0
[ 47.275013] [<ffffffff813f23ae>] ? bio_add_page+0x5e/0x70
[ 47.275013] [<ffffffff813fb930>] generic_make_request+0xc0/0x110
[ 47.275013] [<ffffffff813fb9e8>] submit_bio+0x68/0x150
[ 47.275013] [<ffffffff811b0c6c>] ? lru_cache_add+0x1c/0x50
[ 47.275013] [<ffffffff8125972a>] mpage_bio_submit+0x2a/0x40
[ 47.275013] [<ffffffff8125a81c>] mpage_readpages+0x10c/0x130
[ 47.275013] [<ffffffff81254040>] ? I_BDEV+0x10/0x10
[ 47.275013] [<ffffffff81254040>] ? I_BDEV+0x10/0x10
[ 47.275013] [<ffffffff8119e417>] ? __page_cache_alloc+0x137/0x160
[ 47.275013] [<ffffffff8125486d>] blkdev_readpages+0x1d/0x20
[ 47.275013] [<ffffffff811ae43f>] __do_page_cache_readahead+0x29f/0x320
[ 47.275013] [<ffffffff811ae305>] ? __do_page_cache_readahead+0x165/0x320
[ 47.275013] [<ffffffff811aea14>] force_page_cache_readahead+0x34/0x60
[ 47.275013] [<ffffffff811aea86>] page_cache_sync_readahead+0x46/0x50
[ 47.275013] [<ffffffff811a094c>] generic_file_read_iter+0x52c/0x640
[ 47.275013] [<ffffffff81254c17>] blkdev_read_iter+0x37/0x40
[ 47.275013] [<ffffffff81211a0e>] new_sync_read+0x7e/0xb0
[ 47.275013] [<ffffffff81212c88>] __vfs_read+0x18/0x50
[ 47.275013] [<ffffffff81212d4d>] vfs_read+0x8d/0x150
[ 47.275013] [<ffffffff81212e59>] SyS_read+0x49/0xb0
[ 47.275013] [<ffffffff817063b2>] system_call_fastpath+0x12/0x17
[ 47.275013] Code: 97 18 03 00 00 bf 04 00 00 00 41 f7 f1 83 f8 04 0f 43 f8 b8 ff ff ff ff 44 39 d7 0f 86 c1 00 00 00 41 8b 00 48 89 4d c0 49 89 f5 <8b> 4e 08 8b 56 0c 4c 89 45 b0 c7 45 c8 00 00 00 00 41 89 c4 89
[ 47.275013] RIP [<ffffffff8140b31d>] __bt_get.isra.5+0x7d/0x1e0
[ 47.275013] RSP <ffff880079baf898>
[ 47.275013] CR2: 0000000000000018
[ 47.275013] ---[ end trace 9a650b674f0fae74 ]---
[ 47.701261] note: kworker/3:2[225] exited with preempt_count 1
[ 47.815398] BUG: unable to handle kernel paging request at ffffffffffffff98
[ 47.816324] IP: [<ffffffff810a00d0>] kthread_data+0x10/0x20
[ 47.816324] PGD 1c0c067 PUD 1c0e067 PMD 0
[ 47.816324] Oops: 0000 [#3] SMP
[ 47.816324] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache dm_round_robin loop dm_multipath 9pnet_virtio rtc_cmos 9pnet acpi_cpufreq serio_raw i2c_piix4 virtio_net
[ 47.816324] CPU: 3 PID: 225 Comm: kworker/3:2 Tainted: G D W 4.0.0 #303
[ 47.816324] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
[ 47.816324] task: ffff88007ac90000 ti: ffff88007906c000 task.ti: ffff88007906c000
[ 47.816324] RIP: 0010:[<ffffffff810a00d0>] [<ffffffff810a00d0>] kthread_data+0x10/0x20
[ 47.816324] RSP: 0018:ffff88007906f5e8 EFLAGS: 00010092
[ 47.816324] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 000000000000000f
[ 47.816324] RDX: 000000000000000f RSI: 0000000000000003 RDI: ffff88007ac90000
[ 47.816324] RBP: ffff88007906f5e8 R08: ffff88007ac90090 R09: 0000000000000000
[ 47.816324] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88007fdd4dc0
[ 47.816324] R13: ffff88007ac90000 R14: 0000000000000003 R15: 0000000000000000
[ 47.816324] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 47.816324] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 47.816324] CR2: 0000000000000028 CR3: 0000000001c0b000 CR4: 00000000001407e0
[ 47.816324] Stack:
[ 47.816324] ffff88007906f608 ffffffff81099f35 ffff88007906f608 00000000001d4dc0
[ 47.816324] ffff88007906f678 ffffffff816ff757 ffffffff8107cfc6 0000000000000292
[ 47.816324] ffffffff8107cffa ffff88007ac90000 ffff88007906f678 0000000000000296
[ 47.816324] Call Trace:
[ 47.816324] [<ffffffff81099f35>] wq_worker_sleeping+0x15/0xa0
[ 47.816324] [<ffffffff816ff757>] __schedule+0xa77/0x1080
[ 47.816324] [<ffffffff8107cfc6>] ? do_exit+0x756/0xbf0
[ 47.816324] [<ffffffff8107cffa>] ? do_exit+0x78a/0xbf0
[ 47.816324] [<ffffffff816ffd97>] schedule+0x37/0x90
[ 47.816324] [<ffffffff8107d0d6>] do_exit+0x866/0xbf0
[ 47.816324] [<ffffffff810ec14e>] ? kmsg_dump+0xfe/0x200
[ 47.816324] [<ffffffff810068ad>] oops_end+0x8d/0xd0
[ 47.816324] [<ffffffff81047849>] no_context+0x119/0x370
[ 47.816324] [<ffffffff810ce795>] ? cpuacct_charge+0x5/0x1c0
[ 47.816324] [<ffffffff810b4a25>] ? sched_clock_local+0x25/0x90
[ 47.816324] [<ffffffff81047b25>] __bad_area_nosemaphore+0x85/0x210
[ 47.816324] [<ffffffff81047cc3>] bad_area_nosemaphore+0x13/0x20
[ 47.816324] [<ffffffff81047fb6>] __do_page_fault+0xb6/0x490
[ 47.816324] [<ffffffff8104839c>] do_page_fault+0xc/0x10
[ 47.816324] [<ffffffff817080c2>] page_fault+0x22/0x30
[ 47.816324] [<ffffffff8140b31d>] ? __bt_get.isra.5+0x7d/0x1e0
[ 47.816324] [<ffffffff8140b4e5>] bt_get+0x65/0x1e0
[ 47.816324] [<ffffffff810c9b40>] ? wait_woken+0xa0/0xa0
[ 47.816324] [<ffffffff8140ba07>] blk_mq_get_tag+0xa7/0xd0
[ 47.816324] [<ffffffff8140630b>] __blk_mq_alloc_request+0x1b/0x200
[ 47.816324] [<ffffffff81407f91>] blk_mq_alloc_request+0xa1/0x250
[ 47.816324] [<ffffffff813fc74c>] blk_get_request+0x2c/0xf0
[ 47.816324] [<ffffffff810a6acd>] ? __might_sleep+0x4d/0x90
[ 47.816324] [<ffffffff815747dd>] scsi_execute+0x3d/0x1f0
[ 47.816324] [<ffffffff815763be>] scsi_execute_req_flags+0x8e/0x100
[ 47.816324] [<ffffffff81576a43>] scsi_test_unit_ready+0x83/0x130
[ 47.816324] [<ffffffff8158672e>] sd_check_events+0x14e/0x1b0
[ 47.816324] [<ffffffff8140e731>] disk_check_events+0x51/0x170
[ 47.816324] [<ffffffff8140e86c>] disk_events_workfn+0x1c/0x20
[ 47.816324] [<ffffffff81099128>] process_one_work+0x1e8/0x800
[ 47.816324] [<ffffffff8109909d>] ? process_one_work+0x15d/0x800
[ 47.816324] [<ffffffff8109981a>] ? worker_thread+0xda/0x470
[ 47.816324] [<ffffffff81099793>] worker_thread+0x53/0x470
[ 47.816324] [<ffffffff81099740>] ? process_one_work+0x800/0x800
[ 47.816324] [<ffffffff81099740>] ? process_one_work+0x800/0x800
[ 47.816324] [<ffffffff8109f652>] kthread+0xf2/0x110
[ 47.816324] [<ffffffff810d3d4d>] ? trace_hardirqs_on+0xd/0x10
[ 47.816324] [<ffffffff8109f560>] ? kthread_create_on_node+0x230/0x230
[ 47.816324] [<ffffffff81706308>] ret_from_fork+0x58/0x90
[ 47.816324] [<ffffffff8109f560>] ? kthread_create_on_node+0x230/0x230
[ 47.816324] Code: 00 48 89 e5 5d 48 8b 40 88 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 20 04 00 00 55 48 89 e5 <48> 8b 40 98 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[ 47.816324] RIP [<ffffffff810a00d0>] kthread_data+0x10/0x20
[ 47.816324] RSP <ffff88007906f5e8>
[ 47.816324] CR2: ffffffffffffff98
[ 47.816324] ---[ end trace 9a650b674f0fae76 ]---
[ 47.816324] Fixing recursive fault but reboot is needed!
---- [end of call traces] ----
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/