An issue of kernel stall caused by infiniband mad packet

From: Yacan Liu
Date: Wed Feb 22 2023 - 08:00:59 EST


Hi folks:

We recently encountered a kernel stall issue related to infiniband mad packet. We believe this is caused by some abnormal operations of user-level programs.But I am not sure if the kernel driver needs to do some protection.

In our cluster, multiple nodes will establish multiple RDMA connections with each other. Then, when we start the program, the kernel prompts the following error after one minute:

[ 5489.530125] rcu: INFO: rcu_sched self-detected stall on CPU
[ 5489.535714] rcu: 55-....: (19027 ticks this GP) idle=236/1/0x4000000000000000 softirq=13510/13510 fqs=4359
[ 5489.545554] (t=21015 jiffies g=279845 q=701559)
[ 5489.545557] NMI backtrace for cpu 55
[ 5489.545559] CPU: 55 PID: 44979 Comm: kworker/u196:1 Tainted: G OE 5.10.91-netease3-1 #2
[ 5489.545560] Hardware name: Inspur SA5212M5/YZMB-00882-10T, BIOS 4.1.21 08/25/2021
[ 5489.545582] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
[ 5489.545584] Call Trace:
[ 5489.545586]
[ 5489.545590] dump_stack+0x6b/0x83
[ 5489.545593] nmi_cpu_backtrace.cold+0x32/0x69
[ 5489.545596] ? lapic_can_unplug_cpu+0x80/0x80
[ 5489.545599] nmi_trigger_cpumask_backtrace+0xd7/0xe0
[ 5489.545602] rcu_dump_cpu_stacks+0xa2/0xd0
[ 5489.545604] rcu_sched_clock_irq.cold+0x1ff/0x3d6
[ 5489.545607] ? trigger_load_balance+0x5a/0x240
[ 5489.545610] update_process_times+0x8c/0xc0
[ 5489.545613] tick_sched_handle+0x22/0x60
[ 5489.545615] tick_sched_timer+0x7c/0xb0
[ 5489.545617] ? tick_do_update_jiffies64.part.0+0xc0/0xc0
[ 5489.545619] __hrtimer_run_queues+0x12a/0x270
[ 5489.545621] hrtimer_interrupt+0x110/0x2c0
[ 5489.545623] __sysvec_apic_timer_interrupt+0x5f/0xd0
[ 5489.545627] asm_call_irq_on_stack+0xf/0x20
[ 5489.545627]
[ 5489.545629] sysvec_apic_timer_interrupt+0x72/0x80
[ 5489.545631] asm_sysvec_apic_timer_interrupt+0x12/0x20

OS: Debian 11
Kernel: 5.10.91
ofed: MLNX_OFED_LINUX-5.5-1.0.3.2

PS. Thread worker/u196:2+ib-comp-unb-w is consuming 100% CPU.

After some investigations, We found that the kernel thread spends too much time in the function wait_for_response(). We know that in the process of establishing an RDMA connection, when the local side sends out a MAD request, it will wait for the response packet from the remote side within a limited time, and it is in the wait_list at this time. This wait_list is sorted according to the timeout time, which means the complexity of insertion or search is O(n). When the remote application does not call rdma_accept() for some reason, it will not send the response packet. In this way, the entry will not be cleared from the wait_list on the local side for a long time. And the local application continues to initiate connections outward, which makes the elements in the wait_list more and more, and the wait_for_response() longer and longer.

Regard,
Yacan