Re: [PATCH blktests v1 2/3] nvme/rc: Avoid triggering host nvme-cli autoconnect

From: Max Gurtovoy
Date: Wed Jul 12 2023 - 20:12:14 EST




On 12/07/2023 15:04, Daniel Wagner wrote:
On Mon, Jul 10, 2023 at 07:30:20PM +0300, Max Gurtovoy wrote:


On 10/07/2023 18:03, Daniel Wagner wrote:
On Mon, Jul 10, 2023 at 03:31:23PM +0300, Max Gurtovoy wrote:
I think it is more than just commit message.

Okay, starting to understand what's the problem.

A lot of code that we can avoid was added regarding the --context cmdline
argument.

Correct and it's not optional to get the tests passing for the fc transport.

why the fc needs the --context to pass tests ?

A typical nvme test consists out of following steps (nvme/004):

// nvme target setup (1)
_create_nvmet_subsystem "blktests-subsystem-1" "${loop_dev}" \
"91fdba0d-f87b-4c25-b80f-db7be1418b9e"
_add_nvmet_subsys_to_port "${port}" "blktests-subsystem-1"

// nvme host setup (2)
_nvme_connect_subsys "${nvme_trtype}" blktests-subsystem-1

local nvmedev
nvmedev=$(_find_nvme_dev "blktests-subsystem-1")
cat "/sys/block/${nvmedev}n1/uuid"
cat "/sys/block/${nvmedev}n1/wwid"

// nvme host teardown (3)
_nvme_disconnect_subsys blktests-subsystem-1

// nvme target teardown (4)
_remove_nvmet_subsystem_from_port "${port}" "blktests-subsystem-1"
_remove_nvmet_subsystem "blktests-subsystem-1"


The corresponding output with --context

run blktests nvme/004 at 2023-07-12 13:49:50
// (1)
loop0: detected capacity change from 0 to 32768
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
(NULL device *): {0:0} Association created
[174] nvmet: ctrl 1 start keep-alive timer for 5 secs
// (2)
nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[374] nvmet: adding queue 1 to ctrl 1.
[1138] nvmet: adding queue 2 to ctrl 1.
[73] nvmet: adding queue 3 to ctrl 1.
[174] nvmet: adding queue 4 to ctrl 1.
nvme nvme2: NVME-FC{0}: controller connect complete
nvme nvme2: NVME-FC{0}: new ctrl: NQN "blktests-subsystem-1"
// (3)
nvme nvme2: Removing ctrl: NQN "blktests-subsystem-1"
// (4)
[1138] nvmet: ctrl 1 stop keep-alive
(NULL device *): {0:0} Association deleted
(NULL device *): {0:0} Association freed
(NULL device *): Disconnect LS failed: No Association


and without --context

run blktests nvme/004 at 2023-07-12 13:50:33
// (1)
loop1: detected capacity change from 0 to 32768
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"

why does this association to discovery controller created ? because of some system service ?

can we configure the blktests subsystem not to be discovered or add some access list to it ?

(NULL device *): {0:0} Association created
[1235] nvmet: ctrl 1 start keep-alive timer for 120 secs
// XXX udev auto connect
nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:242d4a24-2484-4a80-8234-d0169409c5e8.
nvme nvme2: NVME-FC{0}: controller connect complete
nvme nvme2: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme3: NVME-FC{1}: create association : host wwpn 0x20001100aa000002 rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
(NULL device *): {0:1} Association created
[73] nvmet: ctrl 2 start keep-alive timer for 5 secs
// (2)
nvmet: creating nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[374] nvmet: adding queue 1 to ctrl 2.
[233] nvmet: adding queue 2 to ctrl 2.
[73] nvmet: adding queue 3 to ctrl 2.
[1235] nvmet: adding queue 4 to ctrl 2.
nvme nvme3: NVME-FC{1}: controller connect complete
nvme nvme3: NVME-FC{1}: new ctrl: NQN "blktests-subsystem-1"
// (3)
nvme nvme3: Removing ctrl: NQN "blktests-subsystem-1"

bellow sounds like a bug we should fix :)

general protection fault, probably for non-canonical address 0xdffffc00000000a4: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000520-0x0000000000000527]
CPU: 1 PID: 2076 Comm: kworker/1:1 Tainted: G W 6.4.0-rc2+ #7 f2a41a58e59b44ee1bb7bc68087ccbe6d76392dd
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown unknown
Workqueue: nvmet-wq fcloop_fcp_recv_work [nvme_fcloop]
RIP: 0010:nvmet_execute_disc_get_log_page+0x23f/0x8c0 [nvmet]
Code: e8 c6 12 c7 e0 4c 89 6c 24 40 48 89 5c 24 08 4c 8b 3b 49 8d 9f 20 05 00 00 48 89 d8 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df <80> 3c 08 00 74 08 48 89 df e8 93 12 c7 e0 4c 89 74 24 30 4c 8b 2b
RSP: 0018:ffff888139a778a0 EFLAGS: 00010202
RAX: 00000000000000a4 RBX: 0000000000000520 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffffa8af3a88
RBP: ffff888139a77ab0 R08: dffffc0000000000 R09: fffffbfff515e752
R10: 0000000000000000 R11: dffffc0000000001 R12: 1ffff1102734ef20
R13: ffff888105563260 R14: ffff888105563270 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88815a600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000824220 CR3: 0000000106512005 CR4: 0000000000370ee0
Call Trace:
<TASK>
? prepare_alloc_pages+0x1c5/0x580
? __cfi_nvmet_execute_disc_get_log_page+0x10/0x10 [nvmet 1da13efcd047161c3381cb240a95399f951fd98f]
? __alloc_pages+0x30e/0x650
? slab_post_alloc_hook+0x67/0x350
? __cfi___alloc_pages+0x10/0x10
? alloc_pages+0x30e/0x530
? sgl_alloc_order+0x118/0x320
nvmet_fc_queue_fcp_req+0xa19/0xda0 [nvmet_fc 11628cdb09a094fd591bfaf88be45b97e3b18e3a]
? nvmet_fc_rcv_fcp_req+0x9c0/0x9c0 [nvmet_fc 11628cdb09a094fd591bfaf88be45b97e3b18e3a]
? lockdep_hardirqs_on_prepare+0x2aa/0x5e0
? nvmet_fc_rcv_fcp_req+0x4de/0x9c0 [nvmet_fc 11628cdb09a094fd591bfaf88be45b97e3b18e3a]
nvmet_fc_rcv_fcp_req+0x4f0/0x9c0 [nvmet_fc 11628cdb09a094fd591bfaf88be45b97e3b18e3a]
fcloop_fcp_recv_work+0x173/0x440 [nvme_fcloop 05cf1144b564c4e1626f9f15422ccf61f2af41de]
process_one_work+0x80f/0xfb0
? rescuer_thread+0x1130/0x1130
? do_raw_spin_trylock+0xc9/0x1f0
? lock_acquired+0x310/0x9a0
? worker_thread+0xd5e/0x1260
worker_thread+0x91e/0x1260
? __cfi_lock_release+0x10/0x10
? do_raw_spin_unlock+0x116/0x8a0
kthread+0x25d/0x2f0
? __cfi_worker_thread+0x10/0x10
? __cfi_kthread+0x10/0x10
ret_from_fork+0x29/0x50
</TASK>

Maybe it's worth cleaning it..

It really solves the problem that the autoconnect setup of nvme-cli is
distrubing the tests (*). The only other way I found to stop the autoconnect is
by disabling the udev rule completely. If autoconnect isn't enabled the context
isn't necessary. Though changing system configuration from blktests seems at bit
excessive.

we should not stop any autoconnect during blktests. The autoconnect and all
the system admin services should run normally.

I do not agree here. The current blktests are not designed for run as
intergration tests. Sure we should also tests this but currently blktests is
just not there and tcp/rdma are not actually covered anyway.

what do you mean tcp/rdma not covered ?

And maybe we should make several changes in the blktests to make it standalone without interfering the existing configuration make by some system administrator.


Thanks,
Daniel