Re: Deadlock in cleanup_net and addrconf_verify_work locks up workqueue
From: Dongyang
Date: Wed Jan 04 2023 - 08:07:23 EST
Hello guys,
At the beginning of the New Year, I also encountered this issue.
Hi Sargun,
Did you finally resolve this issue? As it was passed ~3 years, hope you still remember something about this case, thanks.
Hi Eric,
Below is my log, please let me give some feedback about your previous comments. Thanks.
Jan 1 00:06:30 kernel: [109121.968881] 000: perf: interrupt took too long (3914 > 3912), lowering kernel.perf_event_max_sample_rate to 51000
Jan 2 00:00:06 kernel: [195138.235171] 026: audit: type=1400 audit(1672588806.418:41): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=6221 comm="cups-browsed" capability=23 capname="sys_nice"
Jan 2 00:04:08 kernel: [195380.604772] 027: INFO: task kworker/u56:2:6079 blocked for more than 122 seconds.
Jan 2 00:04:08 kernel: [195380.604776] 027: Tainted: G OE 5.4.161-rt67-rc1 #1
Jan 2 00:04:08 kernel: [195380.604777] 027: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 2 00:04:08 kernel: [195380.604779] 027: kworker/u56:2 D 0 6079 2 0x80084000
Jan 2 00:04:08 kernel: [195380.604789] 027: Workqueue: netns cleanup_net
Jan 2 00:04:08 kernel: [195380.604790] 027: Call Trace:
Jan 2 00:04:08 kernel: [195380.604793] 027: __schedule+0x3d4/0x8a0
Jan 2 00:04:08 kernel: [195380.604799] 027: ? __switch_to_asm+0x34/0x70
Jan 2 00:04:08 kernel: [195380.604801] 027: schedule+0x49/0x100
Jan 2 00:04:08 kernel: [195380.604804] 027: schedule_timeout+0x1ed/0x3b0
Jan 2 00:04:08 kernel: [195380.604807] 027: wait_for_completion+0x86/0xe0
Jan 2 00:04:08 kernel: [195380.604810] 027: __flush_work+0x121/0x1d0
Jan 2 00:04:08 kernel: [195380.604814] 027: ? flush_workqueue_prep_pwqs+0x140/0x140
Jan 2 00:04:08 kernel: [195380.604817] 027: flush_work+0x10/0x20
Jan 2 00:04:08 kernel: [195380.604819] 027: rollback_registered_many+0x1b2/0x530
Jan 2 00:04:08 kernel: [195380.604824] 027: unregister_netdevice_many.part.0+0x12/0x90
Jan 2 00:04:08 kernel: [195380.604826] 027: default_device_exit_batch+0x15c/0x190
Jan 2 00:04:08 kernel: [195380.604828] 027: ? do_wait_intr_irq+0x90/0x90
Jan 2 00:04:08 kernel: [195380.604832] 027: ops_exit_list.isra.0+0x61/0x70
Jan 2 00:04:08 kernel: [195380.604835] 027: cleanup_net+0x269/0x3a0
Jan 2 00:04:08 kernel: [195380.604837] 027: process_one_work+0x1c8/0x470
Jan 2 00:04:08 kernel: [195380.604840] 027: worker_thread+0x4a/0x3d0
Jan 2 00:04:08 kernel: [195380.604843] 027: kthread+0x133/0x180
Jan 2 00:04:08 kernel: [195380.604846] 027: ? process_one_work+0x470/0x470
Jan 2 00:04:08 kernel: [195380.604848] 027: ? kthread_park+0x90/0x90
Jan 2 00:04:08 kernel: [195380.604850] 027: ret_from_fork+0x35/0x40
Jan 2 00:06:11 kernel: [195503.484781] 000: INFO: task kworker/u56:2:6079 blocked for more than 245 seconds.
Jan 2 00:06:11 kernel: [195503.484784] 000: Tainted: G OE 5.4.161-rt67-rc1 #1
Jan 2 00:06:11 kernel: [195503.484786] 000: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 2 00:06:11 kernel: [195503.484788] 000: kworker/u56:2 D 0 6079 2 0x80084000
Jan 2 00:06:11 kernel: [195503.484797] 000: Workqueue: netns cleanup_net
Jan 2 00:06:11 kernel: [195503.484798] 000: Call Trace:
Jan 2 00:06:11 kernel: [195503.484802] 000: __schedule+0x3d4/0x8a0
Jan 2 00:06:11 kernel: [195503.484806] 000: ? __switch_to_asm+0x34/0x70
Jan 2 00:06:11 kernel: [195503.484809] 000: schedule+0x49/0x100
Jan 2 00:06:11 kernel: [195503.484811] 000: schedule_timeout+0x1ed/0x3b0
Jan 2 00:06:11 kernel: [195503.484815] 000: wait_for_completion+0x86/0xe0
Jan 2 00:06:11 kernel: [195503.484818] 000: __flush_work+0x121/0x1d0
Jan 2 00:06:11 kernel: [195503.484822] 000: ? flush_workqueue_prep_pwqs+0x140/0x140
Jan 2 00:06:11 kernel: [195503.484825] 000: flush_work+0x10/0x20
Jan 2 00:06:11 kernel: [195503.484827] 000: rollback_registered_many+0x1b2/0x530
Jan 2 00:06:11 kernel: [195503.484832] 000: unregister_netdevice_many.part.0+0x12/0x90
Jan 2 00:06:11 kernel: [195503.484834] 000: default_device_exit_batch+0x15c/0x190
Jan 2 00:06:11 kernel: [195503.484837] 000: ? do_wait_intr_irq+0x90/0x90
Jan 2 00:06:11 kernel: [195503.484840] 000: ops_exit_list.isra.0+0x61/0x70
Jan 2 00:06:11 kernel: [195503.484843] 000: cleanup_net+0x269/0x3a0
Jan 2 00:06:11 kernel: [195503.484846] 000: process_one_work+0x1c8/0x470
Jan 2 00:06:11 kernel: [195503.484849] 000: worker_thread+0x4a/0x3d0
Jan 2 00:06:11 kernel: [195503.484852] 000: kthread+0x133/0x180
Jan 2 00:06:11 kernel: [195503.484854] 000: ? process_one_work+0x470/0x470
Jan 2 00:06:11 kernel: [195503.484856] 000: ? kthread_park+0x90/0x90
Jan 2 00:06:11 kernel: [195503.484858] 000: ret_from_fork+0x35/0x40
Jan 2 00:06:11 kernel: [195503.484863] 000: INFO: task kworker/26:0:6200 blocked for more than 122 seconds.
Jan 2 00:06:11 kernel: [195503.484864] 000: Tainted: G OE 5.4.161-rt67-rc1 #1
Jan 2 00:06:11 kernel: [195503.484865] 000: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 2 00:06:11 kernel: [195503.484866] 000: kworker/26:0 D 0 6200 2 0x80084000
Jan 2 00:06:11 kernel: [195503.484873] 000: Workqueue: ipv6_addrconf addrconf_verify_work
Jan 2 00:06:11 kernel: [195503.484874] 000: Call Trace:
Jan 2 00:06:11 kernel: [195503.484875] 000: __schedule+0x3d4/0x8a0
Jan 2 00:06:11 kernel: [195503.484877] 000: schedule+0x49/0x100
Jan 2 00:06:11 kernel: [195503.484879] 000: __rt_mutex_slowlock+0x8a/0x150
Jan 2 00:06:11 kernel: [195503.484882] 000: rt_mutex_slowlock_locked+0xbb/0x280
Jan 2 00:06:11 kernel: [195503.484884] 000: ? __switch_to_asm+0x40/0x70
Jan 2 00:06:11 kernel: [195503.484886] 000: rt_mutex_slowlock+0x76/0xc0
Jan 2 00:06:11 kernel: [195503.484889] 000: __rt_mutex_lock_state+0x75/0x90
Jan 2 00:06:11 kernel: [195503.484891] 000: _mutex_lock+0x13/0x20
Jan 2 00:06:11 kernel: [195503.484894] 000: rtnl_lock+0x15/0x20
Jan 2 00:06:11 kernel: [195503.484897] 000: addrconf_verify_work+0xe/0x20
Jan 2 00:06:11 kernel: [195503.484899] 000: process_one_work+0x1c8/0x470
Jan 2 00:06:11 kernel: [195503.484902] 000: worker_thread+0x4a/0x3d0
Jan 2 00:06:11 kernel: [195503.484905] 000: kthread+0x133/0x180
Jan 2 00:06:11 kernel: [195503.484906] 000: ? process_one_work+0x470/0x470
Jan 2 00:06:11 kernel: [195503.484908] 000: ? kthread_park+0x90/0x90
Jan 2 00:06:11 kernel: [195503.484910] 000: ret_from_fork+0x35/0x40
Jan 2 00:08:14 kernel: [195626.364781] 027: INFO: task kworker/u56:2:6079 blocked for more than 368 seconds.
Jan 2 00:08:14 kernel: [195626.364785] 027: Tainted: G OE 5.4.161-rt67-rc1 #1
Jan 2 00:08:14 kernel: [195626.364786] 027: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 2 00:08:14 kernel: [195626.364788] 027: kworker/u56:2 D 0 6079 2 0x80084000
Jan 2 00:08:14 kernel: [195626.364797] 027: Workqueue: netns cleanup_net
Jan 2 00:08:14 kernel: [195626.364798] 027: Call Trace:
Jan 2 00:08:14 kernel: [195626.364802] 027: __schedule+0x3d4/0x8a0
Jan 2 00:08:14 kernel: [195626.364807] 027: ? __switch_to_asm+0x34/0x70
Jan 2 00:08:14 kernel: [195626.364810] 027: schedule+0x49/0x100
Jan 2 00:08:14 kernel: [195626.364812] 027: schedule_timeout+0x1ed/0x3b0
Jan 2 00:08:14 kernel: [195626.364816] 027: wait_for_completion+0x86/0xe0
Jan 2 00:08:14 kernel: [195626.364818] 027: __flush_work+0x121/0x1d0
Jan 2 00:08:14 kernel: [195626.364822] 027: ? flush_workqueue_prep_pwqs+0x140/0x140
Jan 2 00:08:14 kernel: [195626.364825] 027: flush_work+0x10/0x20
Jan 2 00:08:14 kernel: [195626.364827] 027: rollback_registered_many+0x1b2/0x530
Jan 2 00:08:14 kernel: [195626.364832] 027: unregister_netdevice_many.part.0+0x12/0x90
Jan 2 00:08:14 kernel: [195626.364835] 027: default_device_exit_batch+0x15c/0x190
Jan 2 00:08:14 kernel: [195626.364837] 027: ? do_wait_intr_irq+0x90/0x90
Jan 2 00:08:14 kernel: [195626.364841] 027: ops_exit_list.isra.0+0x61/0x70
Jan 2 00:08:14 kernel: [195626.364843] 027: cleanup_net+0x269/0x3a0
Jan 2 00:08:14 kernel: [195626.364846] 027: process_one_work+0x1c8/0x470
Jan 2 00:08:14 kernel: [195626.364849] 027: worker_thread+0x4a/0x3d0
Jan 2 00:08:14 kernel: [195626.364852] 027: kthread+0x133/0x180
Jan 2 00:08:14 kernel: [195626.364855] 027: ? process_one_work+0x470/0x470
Jan 2 00:08:14 kernel: [195626.364857] 027: ? kthread_park+0x90/0x90
Jan 2 00:08:14 kernel: [195626.364859] 027: ret_from_fork+0x35/0x40
Jan 2 00:08:14 kernel: [195626.364863] 027: INFO: task kworker/26:0:6200 blocked for more than 245 seconds.
Jan 2 00:08:14 kernel: [195626.364865] 027: Tainted: G OE 5.4.161-rt67-rc1 #1
Jan 2 00:08:14 kernel: [195626.364866] 027: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 2 00:08:14 kernel: [195626.364867] 027: kworker/26:0 D 0 6200 2 0x80084000
Jan 2 00:08:14 kernel: [195626.364873] 027: Workqueue: ipv6_addrconf addrconf_verify_work
Jan 2 00:08:14 kernel: [195626.364874] 027: Call Trace:
Jan 2 00:08:14 kernel: [195626.364875] 027: __schedule+0x3d4/0x8a0
Jan 2 00:08:14 kernel: [195626.364878] 027: schedule+0x49/0x100
Jan 2 00:08:14 kernel: [195626.364880] 027: __rt_mutex_slowlock+0x8a/0x150
Jan 2 00:08:14 kernel: [195626.364882] 027: rt_mutex_slowlock_locked+0xbb/0x280
Jan 2 00:08:14 kernel: [195626.364885] 027: ? __switch_to_asm+0x40/0x70
Jan 2 00:08:14 kernel: [195626.364886] 027: rt_mutex_slowlock+0x76/0xc0
Jan 2 00:08:14 kernel: [195626.364889] 027: __rt_mutex_lock_state+0x75/0x90
Jan 2 00:08:14 kernel: [195626.364892] 027: _mutex_lock+0x13/0x20
Jan 2 00:08:14 kernel: [195626.364894] 027: rtnl_lock+0x15/0x20
Jan 2 00:08:14 kernel: [195626.364898] 027: addrconf_verify_work+0xe/0x20
Jan 2 00:08:14 kernel: [195626.364900] 027: process_one_work+0x1c8/0x470
Jan 2 00:08:14 kernel: [195626.364902] 027: worker_thread+0x4a/0x3d0
Jan 2 00:08:14 kernel: [195626.364905] 027: kthread+0x133/0x180
Jan 2 00:08:14 kernel: [195626.364907] 027: ? process_one_work+0x470/0x470
Jan 2 00:08:14 kernel: [195626.364909] 027: ? kthread_park+0x90/0x90
Jan 2 00:08:14 kernel: [195626.364911] 027: ret_from_fork+0x35/0x40
................
Jan 2 00:14:23 kernel: [195995.004768] 000: INFO: task kworker/u56:2:6079 blocked for more than 737 seconds.
Jan 2 00:14:23 kernel: [195995.004771] 000: Tainted: G OE 5.4.161-rt67-rc1 #1
Jan 2 00:14:23 kernel: [195995.004772] 000: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 2 00:14:23 kernel: [195995.004774] 000: kworker/u56:2 D 0 6079 2 0x80084000
Jan 2 00:14:23 kernel: [195995.004783] 000: Workqueue: netns cleanup_net
Jan 2 00:14:23 kernel: [195995.004784] 000: Call Trace:
Jan 2 00:14:23 kernel: [195995.004788] 000: __schedule+0x3d4/0x8a0
Jan 2 00:14:23 kernel: [195995.004793] 000: ? __switch_to_asm+0x34/0x70
Jan 2 00:14:23 kernel: [195995.004795] 000: schedule+0x49/0x100
Jan 2 00:14:23 kernel: [195995.004797] 000: schedule_timeout+0x1ed/0x3b0
Jan 2 00:14:23 kernel: [195995.004801] 000: wait_for_completion+0x86/0xe0
Jan 2 00:14:23 kernel: [195995.004804] 000: __flush_work+0x121/0x1d0
Jan 2 00:14:23 kernel: [195995.004807] 000: ? flush_workqueue_prep_pwqs+0x140/0x140
Jan 2 00:14:23 kernel: [195995.004810] 000: flush_work+0x10/0x20
Jan 2 00:14:23 kernel: [195995.004813] 000: rollback_registered_many+0x1b2/0x530
Jan 2 00:14:23 kernel: [195995.004817] 000: unregister_netdevice_many.part.0+0x12/0x90
Jan 2 00:14:23 kernel: [195995.004819] 000: default_device_exit_batch+0x15c/0x190
Jan 2 00:14:23 kernel: [195995.004822] 000: ? do_wait_intr_irq+0x90/0x90
Jan 2 00:14:23 kernel: [195995.004825] 000: ops_exit_list.isra.0+0x61/0x70
Jan 2 00:14:23 kernel: [195995.004828] 000: cleanup_net+0x269/0x3a0
Jan 2 00:14:23 kernel: [195995.004831] 000: process_one_work+0x1c8/0x470
Jan 2 00:14:23 kernel: [195995.004834] 000: worker_thread+0x4a/0x3d0
Jan 2 00:14:23 kernel: [195995.004837] 000: kthread+0x133/0x180
Jan 2 00:14:23 kernel: [195995.004839] 000: ? process_one_work+0x470/0x470
Jan 2 00:14:23 kernel: [195995.004841] 000: ? kthread_park+0x90/0x90
Jan 2 00:14:23 kernel: [195995.004843] 000: ret_from_fork+0x35/0x40
After "task kworker/u56:2:6079 blocked for more than 737 seconds.",
the network seems down, both ssh, ifconfig can't work.
>> Sure, PID 1369493 addrconf_verify_work() is waiting for RTNL.
>>
>> But PID 8 ?
>>
>> __flush_work() is being called.
>>
>> But from where ? Stacks seem not complete.
__flush_work is just from the work queue, so, Yes,
we don't know who put it into the queue. And don't know whether the clean netns work is reasonable.
Maybe I need to add a trace at the put_net.
put_net
__put_net queue_work(netns_wq, &net_cleanup_work);
DECLARE_WORK(net_cleanup_work, cleanup_net);
cleanup_net
ops_exit_list
default_device_exit_batch
unregister_netdevice_many
rollback_registered_many
flush_work
__flush_work
>> But PID 1369493 is waiting on a mutex, thus properly yielding the cpu.
>> (schedule() is clearly shown)
>>
>> This should not prevent other threads
>> from making progress so that flush_all_backlogs() completes eventually.
>>
>> flush_all_backlogs() does not care of how many threads are currently blocked
>> because they can not grab rtnl while flush_all_backlogs() is running.
In my log, we can see even the schedule is shown but the task is still in the D status.
It seems the "addrconf_verify_work" is blocked by "cleanup_net".
But why the "cleanup_net" is blocked?
Currently, my plan is trying to add trace/print at the "__flush_work" to see: what work is blocked,
then research this work is owned by who.
As an expert, if you can give some advice, it will be very grateful.
BTW, this is a RT kernel, I'm also checking the system workload.
BR,
Dongyang