Re: [RFC PATCH] nvme-tcp: fix possible crash when only partilal queues are allocated by nvme_tcp_alloc_queue

From: Sagi Grimberg
Date: Wed Mar 15 2023 - 06:40:54 EST




On 3/15/23 11:41, zhangyanjun@xxxxxxxx wrote:
From: Yanjun Zhang <zhangyanjun@xxxxxxxx>

Thanks for your applying and suggestions!

Hi Yanjun,

In storage clusters constructed by nvme-tcp driver, we have encountered
the following two kinds of crashes on the host kernel repeatedly.

one crash info:
[16802.107284] nvme nvme4: starting error recovery
[16802.109166] nvme nvme4: Reconnecting in 10 seconds...
[16812.173535] nvme nvme4: failed to connect socket: -111
[16812.173745] nvme nvme4: Failed reconnect attempt 1
[16812.173747] nvme nvme4: Reconnecting in 10 seconds...
[16822.413555] nvme nvme4: failed to connect socket: -111
[16822.413762] nvme nvme4: Failed reconnect attempt 2
[16822.413765] nvme nvme4: Reconnecting in 10 seconds...
[16832.661274] nvme nvme4: creating 32 I/O queues.
[16833.919887] BUG: kernel NULL pointer dereference, address: 0000000000000088
[16833.920068] nvme nvme4: Failed reconnect attempt 3
[16833.920094] #PF: supervisor write access in kernel mode
[16833.920261] nvme nvme4: Reconnecting in 10 seconds...
[16833.920368] #PF: error_code(0x0002) - not-present page
[16833.920600] PGD a21a12067 P4D b998cf067 PUD c00be5067 PMD 0
[16833.920731] Oops: 0002 [#1] SMP NOPTI
[16833.920852] CPU: 28 PID: 1324 Comm: kworker/28:1H Kdump: loaded Tainted: G S 5.15.67.cl9.x86_64 #1
[16833.920980] Hardware name: Inspur NF5280M6/NF5280M6, BIOS 06.00.04 07/07/2022
[16833.921086] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
[16833.921191] RIP: 0010:_raw_spin_lock_bh+0x17/0x30
[16833.921294] Code: ff 66 90 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 81 05 f0 ab 35 4a 00 02 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 05 c3 cc cc cc cc 89 c6 e8 c7 09 69 ff 66 90 c3 cc
[16833.921539] RSP: 0018:ff5d9770a078fdf8 EFLAGS: 00010246
[16833.921658] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[16833.921787] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000088
[16833.921912] RBP: 0000000000000088 R08: ff4f755fca433f00 R09: ff4f7558374a7874
[16833.922040] R10: 000000000000000c R11: 000000000000000c R12: ff4f7563dbbdad80
[16833.922177] R13: 0000000100fc48f3 R14: ff4f754b6dbfa278 R15: ff8f97707db0b905
[16833.922317] FS: 0000000000000000(0000) GS:ff4f756fccd00000(0000) knlGS:0000000000000000
[16833.922454] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16833.922593] CR2: 0000000000000088 CR3: 0000000b8477e002 CR4: 0000000000771ee0
[16833.922731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16833.922869] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[16833.923007] PKRU: 55555554
[16833.923138] Call Trace:
[16833.923271] <TASK>
[16833.923402] lock_sock_nested+0x1e/0x50
[16833.923545] nvme_tcp_try_recv+0x40/0xa0 [nvme_tcp]
[16833.923685] nvme_tcp_io_work+0x68/0xa0 [nvme_tcp]
[16833.923824] process_one_work+0x1e8/0x390
[16833.923969] worker_thread+0x53/0x3d0
[16833.924104] ? process_one_work+0x390/0x390
[16833.924240] kthread+0x124/0x150
[16833.924376] ? set_kthread_struct+0x50/0x50
[16833.924518] ret_from_fork+0x1f/0x30
[16833.924655] </TASK>

By analyzing the vmcore, we confirm the direct cause is sock->sk=NULL
and sock->ops=NULL attached to the struct nvme_tcp_queue. And we found
this issue occurred during the execuation of __nvme_tcp_alloc_io_queues
every time. Such as when nvme_tcp_alloc_queue failed to allocate the
29th I/O queue, the return flow would free previously allocated 28 I/O
queues, but possibly some io_works attached to these allocated I/O queues
have already been queued.

That should not be possible, io_work is only triggered after a queue is
started. If io_work is running before the queue is started, we have
a wider issue. Did you confirm that the io_work belongs to an io queue
and not the admin queue?

Yes, we think that the io_work belongs to an io_queue by comparing the
nvme_tcp_queue pointer and queues[] attached to nvme_tcp_ctrl, its qid is
not zero. The size of struct nvme_tcp_queue in our kernel is 0x130.
crash> nvme_tcp_queue.sock ff4f754b6dbfa270
sock = 0xff4f7563dbbdad80,
crash> socket.sk 0xff4f7563dbbdad80
sk = 0x0,
crash> socket.ops 0xff4f7563dbbdad80
ops = 0x0,
crash> nvme_tcp_queue.ctrl ff4f754b6dbfa270
ctrl = 0xff4f75608c9a6000,
crash> nvme_tcp_ctrl.queues 0xff4f75608c9a6000
queues = 0xff4f754b6dbf8000,

Obviously, the ff4f754b6dbfa270 which analysized from stack is not
the admin queue pointer.

ff4f754b6dbfa270 is the sock pointer, not the queue pointer, but I
believe you that it is different from queues.

And we found that sock attached to the
29th I/O queue is NULL, socks attached to other queues are all
normal kernel addresses. We think it should be dealt with err_sock
in nvme_tcp_alloc_queue due to some errors.
err_sock:
sock_release(queue->sock);
queue->sock = NULL;
crash> nvme_tcp_queue.sock ff4f754b6dbfa3a0
sock = 0x0,
crash> nvme_tcp_queue.sock ff4f754b6dbfa4d0
sock = 0xff4f75641f4aa700,


Makes sense. thanks for confirming.



So we query that directly freeing these allocated
I/O queues may result in this issue. To avoid this issuse, we try to make
the following code modifications, but have not verified to be valid yet
because of the occasionality of this issue.

Now, we can not confirm the caller of queuing on the io_work during
allocating all the I/O queues, but we doubt that the socket interface
sk_data_ready=nvme_tcp_data_ready initialized by nvme_tcp_alloc_queue
may do this action in this interval. Hope to get some suggestions, Thanks!

I would start by determining that the running io_work is an io queue or
the admin queue. It is also possible that the admin queue cleanup after
we io queues allocation failed is the problematic teardown that is
causing this bug.


Signed-off-by: Yanjun Zhang <zhangyanjun@xxxxxxxx>
---
drivers/nvme/host/tcp.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9b47dcb2a..5782183b8 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1761,6 +1761,7 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
{
int i, ret;
+ struct nvme_tcp_ctrl *tcp_ctrl = NULL;

for (i = 1; i < ctrl->queue_count; i++) {
ret = nvme_tcp_alloc_queue(ctrl, i);
@@ -1771,9 +1772,11 @@ static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
return 0;

out_free_queues:
- for (i--; i >= 1; i--)
+ for (i--; i >= 1; i--) {
+ tcp_ctrl = to_tcp_ctrl(ctrl);
+ __nvme_tcp_stop_queue(&tcp_ctrl->queues[i]);
nvme_tcp_free_queue(ctrl, i);
-
+ }
return ret;
}


The crash dump reminds me of a reported issue fixed by:
160f3549a907 ("nvme-tcp: fix UAF when detecting digest errors")

Is this applied in your host kernel?

I checked the above issue, and the kernel we used have applied this fix.

Good to know.

And I have add the following trace to our reproduction attempts.


I'd also add this trace to your reproduction attempts:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0ec7d3329595..0cbc8afb41c8 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -919,6 +919,10 @@ static void nvme_tcp_data_ready(struct sock *sk)

read_lock_bh(&sk->sk_callback_lock);
queue = sk->sk_user_data;
+ if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags)) {
+ pr_err("got data on a deallocated queue %d\n",
nvme_tcp_queue_id(queue));
+ WARN_ON_ONCE(1);
+ }
if (likely(queue && queue->rd_enabled) &&
!test_bit(NVME_TCP_Q_POLLING, &queue->flags))
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
--


If indeed this is the io_queue and for some reason it is running,
perhaps the correct fix would be to set the socket ops later in
start_queue. It will also be symmetric because we clear them in
stop_queue:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0ec7d3329595..b3b1e39966c0 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1617,22 +1617,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl
*nctrl, int qid)
if (ret)
goto err_init_connect;

- queue->rd_enabled = true;
set_bit(NVME_TCP_Q_ALLOCATED, &queue->flags);
- nvme_tcp_init_recv_ctx(queue);
-
- write_lock_bh(&queue->sock->sk->sk_callback_lock);
- queue->sock->sk->sk_user_data = queue;
- queue->state_change = queue->sock->sk->sk_state_change;
- queue->data_ready = queue->sock->sk->sk_data_ready;
- queue->write_space = queue->sock->sk->sk_write_space;
- queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
- queue->sock->sk->sk_state_change = nvme_tcp_state_change;
- queue->sock->sk->sk_write_space = nvme_tcp_write_space;
- #ifdef CONFIG_NET_RX_BUSY_POLL
- queue->sock->sk->sk_ll_usec = 1;
- #endif
- write_unlock_bh(&queue->sock->sk->sk_callback_lock);

return 0;

@@ -1690,6 +1675,21 @@ static int nvme_tcp_start_queue(struct nvme_ctrl
*nctrl, int idx)
struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
int ret;

+ queue->rd_enabled = true;
+ nvme_tcp_init_recv_ctx(queue);
+ write_lock_bh(&queue->sock->sk->sk_callback_lock);
+ queue->sock->sk->sk_user_data = queue;
+ queue->state_change = queue->sock->sk->sk_state_change;
+ queue->data_ready = queue->sock->sk->sk_data_ready;
+ queue->write_space = queue->sock->sk->sk_write_space;
+ queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
+ queue->sock->sk->sk_state_change = nvme_tcp_state_change;
+ queue->sock->sk->sk_write_space = nvme_tcp_write_space;
+ #ifdef CONFIG_NET_RX_BUSY_POLL
+ queue->sock->sk->sk_ll_usec = 1;
+ #endif
+ write_unlock_bh(&queue->sock->sk->sk_callback_lock);
+
if (idx)
ret = nvmf_connect_io_queue(nctrl, idx);
else
--

Yes, your given fix seting the socket ops later in start_queue can correct
this issue properly. Then I have an another doubt that it maybe have hidden
trouble if the label out_free_queue in __nvme_tcp_alloc_io_queues does not
call kernel_sock_shutdown before freeing io_queue. So do you think the
following patch are needed?
Hope to get your reply again!

--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index b55b5a298..824c0ddb3 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1761,6 +1761,8 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
{
int i, ret;
+ struct nvme_tcp_ctrl *tcp_ctrl = to_tcp_ctrl(ctrl);
+ struct nvme_tcp_queue *queue;

for (i = 1; i < ctrl->queue_count; i++) {
ret = nvme_tcp_alloc_queue(ctrl, i);
@@ -1771,8 +1773,11 @@ static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
return 0;

out_free_queues:
- for (i--; i >= 1; i--)
+ for (i--; i >= 1; i--) {
+ queue = &tcp_ctrl->queues[i];
+ kernel_sock_shutdown(queue->sock, SHUT_RDWR);
nvme_tcp_free_queue(ctrl, i);
+ }

return ret;
}
--

I don't see why this is needed. The shutdown prevents us from being able to read/write to/from the socket because it is going away. And at this point, if we have not triggered io_work (and if we didn't assign data_ready yet we can't), there should be no way some other context is
consuming the socket. This is the assumption of the current code.