Hi Dongsu,
On Fri, Apr 17, 2015 at 5:41 AM, Dongsu Park
<dongsu.park@xxxxxxxxxxxxxxxx> wrote:
Hi,
there's a critical bug regarding CPU hotplug, blk-mq, and scsi-mq.
Every time when a CPU is offlined, some arbitrary range of kernel memory
seems to get corrupted. Then after a while, kernel panics at random places
when block IOs are issued. (for example, see the call traces below)
Thanks for the report.
This bug can be easily reproducible with a Qemu VM running with virtio-scsi,
when its guest kernel is 3.19-rc1 or higher, and when scsi-mq is loaded
with blk-mq enabled. And yes, 4.0 release is still affected, as well as
Jens' for-4.1/core. How to reproduce:
# echo 0 > /sys/devices/system/cpu/cpu1/online
(and issue some block IOs, that's it.)
Bisecting between 3.18 and 3.19-rc1, it looks like this bug had been hidden
until commit ccbedf117f01 ("virtio_scsi: support multi hw queue of blk-mq"),
which started to allow virtio-scsi to map virtqueues to hardware queues of
blk-mq. Reverting that commit makes the bug go away. However, I suppose
reverting it could not be a correct solution.
I agree, and that patch only enables multiple hw queues.
More precisely, every time a CPU hotplug event gets triggered,
a call graph is like the following:
blk_mq_queue_reinit_notify()
-> blk_mq_queue_reinit()
-> blk_mq_map_swqueue()
-> blk_mq_free_rq_map()
-> scsi_exit_request()
From that point, as soon as any address in the request gets modified, an
arbitrary range of memory gets corrupted. My first guess was that probably
the exit routine could try to deallocate tags->rqs[] where invalid
addresses are stored. But actually it looks like it's not the case,
and cmd->sense_buffer looks also valid.
It's not obvious to me, exactly what could go wrong.
Does anyone have an idea?
As far as I can see, at least two problems exist:
- race between timeout and CPU hotplug
- in case of shared tags, during CPU online handling, about setting
and checking hctx->tags
So could you please test the attached two patches to see if they fix your issue?
I run them in my VM, and looks opps does disappear.