Re: [RFC PATCH] nvme-tcp: fix possible crash when only partilal queues are allocated by nvme_tcp_alloc_queue

From: zhangyanjun
Date: Wed Mar 15 2023 - 05:42:21 EST


From: Yanjun Zhang <zhangyanjun@xxxxxxxx>

Thanks for your applying and suggestions!

> Hi Yanjun,
>
>> In storage clusters constructed by nvme-tcp driver, we have encountered
>> the following two kinds of crashes on the host kernel repeatedly.
>>
>> one crash info:
>> [16802.107284] nvme nvme4: starting error recovery
>> [16802.109166] nvme nvme4: Reconnecting in 10 seconds...
>> [16812.173535] nvme nvme4: failed to connect socket: -111
>> [16812.173745] nvme nvme4: Failed reconnect attempt 1
>> [16812.173747] nvme nvme4: Reconnecting in 10 seconds...
>> [16822.413555] nvme nvme4: failed to connect socket: -111
>> [16822.413762] nvme nvme4: Failed reconnect attempt 2
>> [16822.413765] nvme nvme4: Reconnecting in 10 seconds...
>> [16832.661274] nvme nvme4: creating 32 I/O queues.
>> [16833.919887] BUG: kernel NULL pointer dereference, address: 0000000000000088
>> [16833.920068] nvme nvme4: Failed reconnect attempt 3
>> [16833.920094] #PF: supervisor write access in kernel mode
>> [16833.920261] nvme nvme4: Reconnecting in 10 seconds...
>> [16833.920368] #PF: error_code(0x0002) - not-present page
>> [16833.920600] PGD a21a12067 P4D b998cf067 PUD c00be5067 PMD 0
>> [16833.920731] Oops: 0002 [#1] SMP NOPTI
>> [16833.920852] CPU: 28 PID: 1324 Comm: kworker/28:1H Kdump: loaded Tainted: G S 5.15.67.cl9.x86_64 #1
>> [16833.920980] Hardware name: Inspur NF5280M6/NF5280M6, BIOS 06.00.04 07/07/2022
>> [16833.921086] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
>> [16833.921191] RIP: 0010:_raw_spin_lock_bh+0x17/0x30
>> [16833.921294] Code: ff 66 90 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 81 05 f0 ab 35 4a 00 02 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 05 c3 cc cc cc cc 89 c6 e8 c7 09 69 ff 66 90 c3 cc
>> [16833.921539] RSP: 0018:ff5d9770a078fdf8 EFLAGS: 00010246
>> [16833.921658] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>> [16833.921787] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000088
>> [16833.921912] RBP: 0000000000000088 R08: ff4f755fca433f00 R09: ff4f7558374a7874
>> [16833.922040] R10: 000000000000000c R11: 000000000000000c R12: ff4f7563dbbdad80
>> [16833.922177] R13: 0000000100fc48f3 R14: ff4f754b6dbfa278 R15: ff8f97707db0b905
>> [16833.922317] FS: 0000000000000000(0000) GS:ff4f756fccd00000(0000) knlGS:0000000000000000
>> [16833.922454] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [16833.922593] CR2: 0000000000000088 CR3: 0000000b8477e002 CR4: 0000000000771ee0
>> [16833.922731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [16833.922869] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [16833.923007] PKRU: 55555554
>> [16833.923138] Call Trace:
>> [16833.923271] <TASK>
>> [16833.923402] lock_sock_nested+0x1e/0x50
>> [16833.923545] nvme_tcp_try_recv+0x40/0xa0 [nvme_tcp]
>> [16833.923685] nvme_tcp_io_work+0x68/0xa0 [nvme_tcp]
>> [16833.923824] process_one_work+0x1e8/0x390
>> [16833.923969] worker_thread+0x53/0x3d0
>> [16833.924104] ? process_one_work+0x390/0x390
>> [16833.924240] kthread+0x124/0x150
>> [16833.924376] ? set_kthread_struct+0x50/0x50
>> [16833.924518] ret_from_fork+0x1f/0x30
>> [16833.924655] </TASK>
>>
>> By analyzing the vmcore, we confirm the direct cause is sock->sk=NULL
>> and sock->ops=NULL attached to the struct nvme_tcp_queue. And we found
>> this issue occurred during the execuation of __nvme_tcp_alloc_io_queues
>> every time. Such as when nvme_tcp_alloc_queue failed to allocate the
>> 29th I/O queue, the return flow would free previously allocated 28 I/O
>> queues, but possibly some io_works attached to these allocated I/O queues
>> have already been queued.
>
> That should not be possible, io_work is only triggered after a queue is
> started. If io_work is running before the queue is started, we have
> a wider issue. Did you confirm that the io_work belongs to an io queue
> and not the admin queue?

Yes, we think that the io_work belongs to an io_queue by comparing the
nvme_tcp_queue pointer and queues[] attached to nvme_tcp_ctrl, its qid is
not zero. The size of struct nvme_tcp_queue in our kernel is 0x130.
crash> nvme_tcp_queue.sock ff4f754b6dbfa270
sock = 0xff4f7563dbbdad80,
crash> socket.sk 0xff4f7563dbbdad80
sk = 0x0,
crash> socket.ops 0xff4f7563dbbdad80
ops = 0x0,
crash> nvme_tcp_queue.ctrl ff4f754b6dbfa270
ctrl = 0xff4f75608c9a6000,
crash> nvme_tcp_ctrl.queues 0xff4f75608c9a6000
queues = 0xff4f754b6dbf8000,

Obviously, the ff4f754b6dbfa270 which analysized from stack is not
the admin queue pointer. And we found that sock attached to the
29th I/O queue is NULL, socks attached to other queues are all
normal kernel addresses. We think it should be dealt with err_sock
in nvme_tcp_alloc_queue due to some errors.
err_sock:
sock_release(queue->sock);
queue->sock = NULL;
crash> nvme_tcp_queue.sock ff4f754b6dbfa3a0
sock = 0x0,
crash> nvme_tcp_queue.sock ff4f754b6dbfa4d0
sock = 0xff4f75641f4aa700,

>
>> So we query that directly freeing these allocated
>> I/O queues may result in this issue. To avoid this issuse, we try to make
>> the following code modifications, but have not verified to be valid yet
>> because of the occasionality of this issue.
>>
>> Now, we can not confirm the caller of queuing on the io_work during
>> allocating all the I/O queues, but we doubt that the socket interface
>> sk_data_ready=nvme_tcp_data_ready initialized by nvme_tcp_alloc_queue
>> may do this action in this interval. Hope to get some suggestions, Thanks!
>
> I would start by determining that the running io_work is an io queue or
> the admin queue. It is also possible that the admin queue cleanup after
> we io queues allocation failed is the problematic teardown that is
> causing this bug.
>
>>
>> Signed-off-by: Yanjun Zhang <zhangyanjun@xxxxxxxx>
>> ---
>> drivers/nvme/host/tcp.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 9b47dcb2a..5782183b8 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1761,6 +1761,7 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
>> static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>> {
>> int i, ret;
>> + struct nvme_tcp_ctrl *tcp_ctrl = NULL;
>>
>> for (i = 1; i < ctrl->queue_count; i++) {
>> ret = nvme_tcp_alloc_queue(ctrl, i);
>> @@ -1771,9 +1772,11 @@ static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>> return 0;
>>
>> out_free_queues:
>> - for (i--; i >= 1; i--)
>> + for (i--; i >= 1; i--) {
>> + tcp_ctrl = to_tcp_ctrl(ctrl);
>> + __nvme_tcp_stop_queue(&tcp_ctrl->queues[i]);
>> nvme_tcp_free_queue(ctrl, i);
>> -
>> + }
>> return ret;
>> }
>>
>
> The crash dump reminds me of a reported issue fixed by:
> 160f3549a907 ("nvme-tcp: fix UAF when detecting digest errors")
>
> Is this applied in your host kernel?

I checked the above issue, and the kernel we used have applied this fix.
And I have add the following trace to our reproduction attempts.

>
> I'd also add this trace to your reproduction attempts:
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0ec7d3329595..0cbc8afb41c8 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -919,6 +919,10 @@ static void nvme_tcp_data_ready(struct sock *sk)
>
> read_lock_bh(&sk->sk_callback_lock);
> queue = sk->sk_user_data;
> + if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) {
> + pr_err("got data on a deallocated queue %d\n",
> nvme_tcp_queue_id(queue));
> + WARN_ON_ONCE(1);
> + }
> if (likely(queue && queue->rd_enabled) &&
> !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> --
>

> If indeed this is the io_queue and for some reason it is running,
> perhaps the correct fix would be to set the socket ops later in
> start_queue. It will also be symmetric because we clear them in
> stop_queue:
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0ec7d3329595..b3b1e39966c0 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1617,22 +1617,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl
> *nctrl, int qid)
> if (ret)
> goto err_init_connect;
>
> - queue->rd_enabled = true;
> set_bit(NVME_TCP_Q_ALLOCATED, &queue->flags);
> - nvme_tcp_init_recv_ctx(queue);
> -
> - write_lock_bh(&queue->sock->sk->sk_callback_lock);
> - queue->sock->sk->sk_user_data = queue;
> - queue->state_change = queue->sock->sk->sk_state_change;
> - queue->data_ready = queue->sock->sk->sk_data_ready;
> - queue->write_space = queue->sock->sk->sk_write_space;
> - queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
> - queue->sock->sk->sk_state_change = nvme_tcp_state_change;
> - queue->sock->sk->sk_write_space = nvme_tcp_write_space;
> - #ifdef CONFIG_NET_RX_BUSY_POLL
> - queue->sock->sk->sk_ll_usec = 1;
> - #endif
> - write_unlock_bh(&queue->sock->sk->sk_callback_lock);
>
> return 0;
>
> @@ -1690,6 +1675,21 @@ static int nvme_tcp_start_queue(struct nvme_ctrl
> *nctrl, int idx)
> struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
> int ret;
>
> + queue->rd_enabled = true;
> + nvme_tcp_init_recv_ctx(queue);
> + write_lock_bh(&queue->sock->sk->sk_callback_lock);
> + queue->sock->sk->sk_user_data = queue;
> + queue->state_change = queue->sock->sk->sk_state_change;
> + queue->data_ready = queue->sock->sk->sk_data_ready;
> + queue->write_space = queue->sock->sk->sk_write_space;
> + queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
> + queue->sock->sk->sk_state_change = nvme_tcp_state_change;
> + queue->sock->sk->sk_write_space = nvme_tcp_write_space;
> + #ifdef CONFIG_NET_RX_BUSY_POLL
> + queue->sock->sk->sk_ll_usec = 1;
> + #endif
> + write_unlock_bh(&queue->sock->sk->sk_callback_lock);
> +
> if (idx)
> ret = nvmf_connect_io_queue(nctrl, idx);
> else
> --

Yes, your given fix seting the socket ops later in start_queue can correct
this issue properly. Then I have an another doubt that it maybe have hidden
trouble if the label out_free_queue in __nvme_tcp_alloc_io_queues does not
call kernel_sock_shutdown before freeing io_queue. So do you think the
following patch are needed?
Hope to get your reply again!

--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index b55b5a298..824c0ddb3 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1761,6 +1761,8 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
{
int i, ret;
+ struct nvme_tcp_ctrl *tcp_ctrl = to_tcp_ctrl(ctrl);
+ struct nvme_tcp_queue *queue;

for (i = 1; i < ctrl->queue_count; i++) {
ret = nvme_tcp_alloc_queue(ctrl, i);
@@ -1771,8 +1773,11 @@ static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
return 0;

out_free_queues:
- for (i--; i >= 1; i--)
+ for (i--; i >= 1; i--) {
+ queue = &tcp_ctrl->queues[i];
+ kernel_sock_shutdown(queue->sock, SHUT_RDWR);
nvme_tcp_free_queue(ctrl, i);
+ }

return ret;
}
--