Re: [bug report] RDMA/iwpm: reentrant iwpm hello message
From: Lin Ma
Date: Tue Dec 24 2024 - 11:17:04 EST
Hello Leon,
> > Please let me know if I understand this correctly or incorrectly?
>
> The thing is that down_write() is called when we unregistering module
> which sent netlink messages. It shouldn't happen.
>
I acknowledge that this is a low-probability event. However, the race
condition still exists; otherwise, these read and write semaphores
would not be necessary. Why not just remove all of them?
Moreover, I find that even without the deadlock, this reentrant message
would hang the kernel and cannot be killed, with logs like below:
(after disabling locking sanitizer, tested in latest ubuntu)
[2187983.899998] INFO: task poc.elf:1717021 blocked for more than 122 seconds.
[2187983.900049] Not tainted 6.8.0-49-generic #49~22.04.1-Ubuntu
[2187983.900057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2187983.900063] task:poc.elf state:D stack:0 pid:1717021 tgid:1717021 ppid:1716834 flags:0x00004006
[2187983.900087] Call Trace:
[2187983.900094] <TASK>
[2187983.900355] __schedule+0x27c/0x6a0
[2187983.900430] schedule+0x33/0x110
[2187983.900442] schedule_preempt_disabled+0x15/0x30
[2187983.900454] __mutex_lock.constprop.0+0x3f8/0x7a0
[2187983.900476] __mutex_lock_slowpath+0x13/0x20
[2187983.900486] mutex_lock+0x3c/0x50
[2187983.900493] __netlink_dump_start+0x76/0x2a0
[2187983.900552] rdma_nl_rcv_msg+0x24c/0x310 [ib_core]
[2187983.900673] ? __pfx_iwpm_hello_cb+0x10/0x10 [iw_cm]
[2187983.900699] rdma_nl_rcv_skb.constprop.0.isra.0+0xbb/0x120 [ib_core]
[2187983.900802] rdma_nl_rcv+0xe/0x20 [ib_core]
[2187983.900898] netlink_unicast+0x1b0/0x2a0
[2187983.900911] rdma_nl_unicast+0x49/0x70 [ib_core]
[2187983.901005] iwpm_send_hello+0xfd/0x150 [iw_cm]
[2187983.901030] iwpm_hello_cb+0xb9/0x130 [iw_cm]
[2187983.901052] netlink_dump+0x1c0/0x340
[2187983.901065] __netlink_dump_start+0x1ef/0x2a0
[2187983.901077] rdma_nl_rcv_msg+0x24c/0x310 [ib_core]
[2187983.901219] ? __pfx_iwpm_hello_cb+0x10/0x10 [iw_cm]
[2187983.901245] rdma_nl_rcv_skb.constprop.0.isra.0+0xbb/0x120 [ib_core]
[2187983.901344] rdma_nl_rcv+0xe/0x20 [ib_core]
[2187983.901437] netlink_unicast+0x1b0/0x2a0
[2187983.901449] rdma_nl_unicast+0x49/0x70 [ib_core]
[2187983.901544] iwpm_send_hello+0xfd/0x150 [iw_cm]
[2187983.901567] iwpm_hello_cb+0xb9/0x130 [iw_cm]
[2187983.901589] netlink_dump+0x1c0/0x340
[2187983.901602] __netlink_dump_start+0x1ef/0x2a0
[2187983.901613] rdma_nl_rcv_msg+0x24c/0x310 [ib_core]
[2187983.901707] ? __pfx_iwpm_hello_cb+0x10/0x10 [iw_cm]
[2187983.901731] rdma_nl_rcv_skb.constprop.0.isra.0+0xbb/0x120 [ib_core]
[2187983.901830] rdma_nl_rcv+0xe/0x20 [ib_core]
[2187983.901922] netlink_unicast+0x1b0/0x2a0
[2187983.901933] netlink_sendmsg+0x214/0x470
[2187983.901946] __sys_sendto+0x21b/0x230
[2187983.901992] __x64_sys_sendto+0x24/0x40
[2187983.902002] x64_sys_call+0x1fc0/0x24b0
[2187983.902023] do_syscall_64+0x81/0x170
[2187983.902059] ? security_file_alloc+0x5f/0xf0
[2187983.902079] ? alloc_empty_file+0x85/0x130
[2187983.902140] ? alloc_file+0x9b/0x170
[2187983.902150] ? alloc_file_pseudo+0x9e/0x100
[2187983.902163] ? restore_fpregs_from_fpstate+0x3d/0xd0
[2187983.902197] ? switch_fpu_return+0x55/0xf0
[2187983.902208] ? syscall_exit_to_user_mode+0x83/0x260
[2187983.902229] ? do_syscall_64+0x8d/0x170
[2187983.902240] ? irqentry_exit+0x43/0x50
[2187983.902249] ? clear_bhb_loop+0x15/0x70
[2187983.902293] ? clear_bhb_loop+0x15/0x70
[2187983.902302] ? clear_bhb_loop+0x15/0x70
[2187983.902311] entry_SYSCALL_64_after_hwframe+0x78/0x80
[2187983.902319] RIP: 0033:0x440624
[2187983.902582] RSP: 002b:00007ffcfa4b29f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[2187983.902592] RAX: ffffffffffffffda RBX: 0000000000400400 RCX: 0000000000440624
[2187983.902598] RDX: 0000000000000018 RSI: 00007ffcfa4b2a30 RDI: 0000000000000003
[2187983.902604] RBP: 00007ffcfa4b3a40 R08: 000000000047df08 R09: 000000000000000c
[2187983.902609] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000403990
[2187983.902614] R13: 0000000000000000 R14: 00000000006a6018 R15: 0000000000000000
That's why I'm quite sure this is a bug and requires fixing.
Thanks
Lin