[PATCH 1/1] block: System crashes when cpu hotplug + bouncing port
From: wenxiong
Date: Mon Jun 28 2021 - 01:02:08 EST
From: Wen Xiong <wenxiong@xxxxxxxxxxxxxxxxxx>
Error inject:
1. run hash ppc64_cpu 2>/dev/null && ppc64_cpu --smt=4
2. Disable one SVC port (at switch) down for 10 mins
3. Enable port back
4. Linux crash
System has two cores with 16 cpus like cpu0-cpu15. All cpus
are online when system boots up.
core0: cpu0-cpu7 online
core1: cpu8-cpu15 online
Issue the following cpu houplug command in ppc:
cpu0-cpu3 are online
cpu4-cpu7 are offline
cpu8-cpu11 are online
cpu12-cpu15 are offline
After this cpu hotplug operations, the state of hctx are changed:
- cpu0-cpu3(online): no change
- cpu4-cpu7(offline): mask off. The state for each hctx set to
INACTIVE, also realloc htcx for this cpu.
- cpu8-cpu11(oneline): cpus are still active but hctxs are disable
after calling realloc hctx
- cpu12-cpu15(offline): mask off, The state for each hctx set to
INACTIVE, hctxs are disable.
>From nvme/fc driver:
nvme_fc_create_association()
->nvme_fc_recreate_io_queues() if ctrl->ioq_live=ture
->nvme_fc_connect_io_queues()
->blk_mq_update_nr_hw_queues()
->nvme_fc_connect_io_queues()
->nvmf_connect_io_queue()
nvme_fc_connect_io_queues(struct nvme_fc_ctrl *ctrl, u16 qsize)
{
for (i = 1; i < ctrl->ctrl.queue_count; i++) {
ret = nvmf_connect_io_queue(&ctrl->ctrl, i, false);
set_bit(NVME_FC_Q_LIVE, &ctrl->queues[i].flags);
}
}
After cpu hotplug, i loop from 1->8, let see what's happned if pass i:
i = 1, call blk_mq_alloc_request_hctx with id = 0 ok
i = 2, call blk_mq_alloc_request_hctx with id = 1 ok
i = 3, call blk_mq_alloc_request_hctx with id = 2 ok
i = 4, call blk_mq_alloc_request_hctx with id = 3 ok
i = 5, call blk_mq_alloc_request_hctx with id = 4 crash (cpu = 2048)
i = 6, call blk_mq_alloc_request_hctx with id = 5 crash (cpu = 2048)
i = 7, call blk_mq_alloc_request_hctx with id = 6 crash (cpu = 2048)
i = 8, call blk_mq_alloc_request_hctx with id = 7 crash (cpu = 2048)
cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
The patch fixed the crash issue when doing bouncing port on storage side + cpu hotplug.
---
block/blk-mq-tag.c | 3 ++-
block/blk-mq.c | 4 +---
2 files changed, 3 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2a37731e8244..b927233bb6bb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -171,7 +171,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
* Give up this allocation if the hctx is inactive. The caller will
* retry on an active hctx.
*/
- if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
+ if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))
+ && data->hctx->queue_num > num_online_cpus()) {
blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
return BLK_MQ_NO_TAG;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c86c01bfecdb..5e31bd9b06c2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -436,7 +436,6 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
.cmd_flags = op,
};
u64 alloc_time_ns = 0;
- unsigned int cpu;
unsigned int tag;
int ret;
@@ -468,8 +467,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
data.hctx = q->queue_hw_ctx[hctx_idx];
if (!blk_mq_hw_queue_mapped(data.hctx))
goto out_queue_exit;
- cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
- data.ctx = __blk_mq_get_ctx(q, cpu);
+ data.ctx = __blk_mq_get_ctx(q, hctx_idx);
if (!q->elevator)
blk_mq_tag_busy(data.hctx);
--
2.27.0