Re: [PATCH 1/1] block: System crashes when cpu hotplug + bouncing port

From: Ming Lei
Date: Tue Jun 29 2021 - 06:07:42 EST


On Tue, Jun 29, 2021 at 11:49:38AM +0200, Daniel Wagner wrote:
> On Tue, Jun 29, 2021 at 05:35:51PM +0800, Ming Lei wrote:
> > With the two patches I posted, __nvme_submit_sync_cmd() shouldn't return
> > error, can you observe the error?
>
> There are still ways the allocation can fail:
>
> ret = blk_queue_enter(q, flags);
> if (ret)
> return ERR_PTR(ret);
>
> ret = -EXDEV;
> data.hctx = q->queue_hw_ctx[hctx_idx];
> if (!blk_mq_hw_queue_mapped(data.hctx))
> goto out_queue_exit;

The above failure is supposed to be handled as error, either queue is
frozen or hctx is unmapped.

>
> No, I don't see any errors. I am still trying to reproduce it on real
> hardware. The setup with blktests running in Qemu did work with all
> patches applied (the once from me and your patches).
>
> About the error argument: Later in the code path, e.g. in
> __nvme_submit_sync_cmd() transport errors (incl. canceled request) are
> handled as well, hence the upper layer will see errors during connection
> attempts. My point is, there is nothing special about the connection
> attempt failing. We have error handling code in place and the above
> state machine has to deal with it.

My two patches not only avoids the kernel panic, but also allow
request to be allocated successfully, then connect io queue request can
be submitted to driver even though all CPUs in hctx->cpumask is offline,
then nvmef can be setup well.

That is the difference with yours to fail the request allocation, then
connect io queues can't be done, and the whole host can't be setup
successfully, then become a brick. The point is that cpu offline shouldn't
fail to setup nvme fc/rdma/tcp/loop.

>
> Anyway, avoiding the if in the hotpath is a good thing. I just don't
> think your argument about no error can happen is correct.

Again, it isn't related with avoiding the if, and it isn't in hotpath
at all.

Thanks,
Ming