Re: Deadlock in cleanup_net and addrconf_verify_work locks up workqueue

From: Eric Dumazet
Date: Wed Feb 12 2020 - 00:36:34 EST




On 2/11/20 9:08 PM, Eric Dumazet wrote:
>
>
> On 2/11/20 11:23 AM, Sargun Dhillon wrote:
>> We've found a workqueue stall / deadlock. Our workload is a container-oriented
>> workload in which we utilize IPv6. Our container (namespace) churn is quite
>> frequent, and containers can be terminated before their networking is
>> even setup.
>>
>> We're running 4.19.73 in production, and in investigation of the underlying
>> causes, I don't think that future versions of 4.19 fix it.
>>
>> We've narrowed it down to a lockup between ipv6_addrconf, and cleanup_net.
>
> Sure, PID 1369493 addrconf_verify_work() is waiting for RTNL.
>
> But PID 8 ?
>
> __flush_work() is being called.
>
> But from where ? Stacks seem not complete.
>
>
>>
>> crash> bt 8
>> PID: 8 TASK: ffff9a1072b50000 CPU: 24 COMMAND: "kworker/u192:0"
>> #0 [ffffbfe2c00fbb70] __schedule at ffffffffa7f02bf7
>> #1 [ffffbfe2c00fbc10] schedule at ffffffffa7f031e8
>> #2 [ffffbfe2c00fbc18] schedule_timeout at ffffffffa7f0700e
>> #3 [ffffbfe2c00fbc90] wait_for_completion at ffffffffa7f03b50
>> #4 [ffffbfe2c00fbce0] __flush_work at ffffffffa76a2532
>> #5 [ffffbfe2c00fbd58] rollback_registered_many at ffffffffa7dbcdf4
>> #6 [ffffbfe2c00fbdc0] unregister_netdevice_many at ffffffffa7dbd31e
>> #7 [ffffbfe2c00fbdd0] default_device_exit_batch at ffffffffa7dbd512
>> #8 [ffffbfe2c00fbe40] cleanup_net at ffffffffa7dab970
>> #9 [ffffbfe2c00fbe98] process_one_work at ffffffffa76a17c4
>> #10 [ffffbfe2c00fbed8] worker_thread at ffffffffa76a19dd
>> #11 [ffffbfe2c00fbf10] kthread at ffffffffa76a7fd3
>> #12 [ffffbfe2c00fbf50] ret_from_fork at ffffffffa80001ff
>>
>> crash> bt 1369493
>> PID: 1369493 TASK: ffff9a03684d9600 CPU: 58 COMMAND: "kworker/58:1"
>> #0 [ffffbfe30d68fd48] __schedule at ffffffffa7f02bf7
>> #1 [ffffbfe30d68fde8] schedule at ffffffffa7f031e8
>> #2 [ffffbfe30d68fdf0] schedule_preempt_disabled at ffffffffa7f0349a
>> #3 [ffffbfe30d68fdf8] __mutex_lock at ffffffffa7f04aed
>> #4 [ffffbfe30d68fe90] addrconf_verify_work at ffffffffa7e8d1aa
>> #5 [ffffbfe30d68fe98] process_one_work at ffffffffa76a17c4
>> #6 [ffffbfe30d68fed8] worker_thread at ffffffffa76a19dd
>> #7 [ffffbfe30d68ff10] kthread at ffffffffa76a7fd3
>> #8 [ffffbfe30d68ff50] ret_from_fork at ffffffffa80001ff
>>
>>
>>
>> struct -x mutex.owner.counter rtnl_mutex
>> owner.counter = 0xffff9a1072b50001
>>
>> 0xffff9a1072b50001 & (~0x07) = 0xffff9a1072b50000
>>
>> This points back to PID 8 / CPU 24. It is working on cleanup_net, and a part
>> of cleanup net involves calling ops_exit_list, and as part of that it calls
>> default_device_exit_batch. default_device_exit_batch takes the rtnl lock before
>> calling into unregister_netdevice_many, and rollback_registered_many.
>> rollback_registered_many calls flush_all_backlogs. This will never complete
>> because it is holding the rtnl lock, and PID 1369493 / CPU 58 is waiting
>> for rtnl_lock.
>
> But PID 1369493 is waiting on a mutex, thus properly yielding the cpu.
> (schedule() is clearly shown)
>
> This should not prevent other threads
> from making progress so that flush_all_backlogs() completes eventually.
>
> flush_all_backlogs() does not care of how many threads are currently blocked
> because they can not grab rtnl while flush_all_backlogs() is running.
>
>>
>> If relevant, the workqueue stalls themselves look something like:
>> BUG: workqueue lockup - pool cpus=70 node=0 flags=0x0 nice=0 stuck for 3720s!
>> BUG: workqueue lockup - pool cpus=70 node=0 flags=0x0 nice=-20 stuck for 3719s!
>> Showing busy workqueues and worker pools:
>> workqueue events: flags=0x0
>> pwq 32: cpus=16 node=0 flags=0x0 nice=0 active=2/256
>> in-flight: 1274779:slab_caches_to_rcu_destroy_workfn slab_caches_to_rcu_destroy_workfn
>> workqueue events_highpri: flags=0x10
>> pwq 141: cpus=70 node=0 flags=0x0 nice=-20 active=1/256
>> pending: flush_backlog BAR(8)
>> workqueue events_power_efficient: flags=0x82
>> pwq 193: cpus=0-23,48-71 node=0 flags=0x4 nice=0 active=1/256
>> in-flight: 1396446:check_lifetime
>> workqueue mm_percpu_wq: flags=0x8
>> pwq 140: cpus=70 node=0 flags=0x0 nice=0 active=1/256
>> pending: vmstat_update
>> workqueue netns: flags=0xe000a
>> pwq 192: cpus=0-95 flags=0x4 nice=0 active=1/1
>> in-flight: 8:cleanup_net
>> delayed: cleanup_net
>> workqueue writeback: flags=0x4e
>> pwq 193: cpus=0-23,48-71 node=0 flags=0x4 nice=0 active=1/256
>> in-flight: 1334335:wb_workfn
>> workqueue kblockd: flags=0x18
>> pwq 141: cpus=70 node=0 flags=0x0 nice=-20 active=1/256
>> pending: blk_mq_run_work_fn
>> workqueue ipv6_addrconf: flags=0x40008
>> pwq 116: cpus=58 node=0 flags=0x0 nice=0 active=1/1
>> in-flight: 1369493:addrconf_verify_work
>> workqueue ena: flags=0xe000a
>> pwq 192: cpus=0-95 flags=0x4 nice=0 active=1/1
>> in-flight: 7505:ena_fw_reset_device [ena]
>>

Can you test the following :

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index cb493e15959c4d1bb68cf30f4099a8daa785bb84..bcc7ce03f13881415f64c7329559c7ed1e6321f3 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4410,8 +4410,6 @@ static void addrconf_verify_rtnl(void)
now = jiffies;
next = round_jiffies_up(now + ADDR_CHECK_FREQUENCY);

- cancel_delayed_work(&addr_chk_work);
-
for (i = 0; i < IN6_ADDR_HSIZE; i++) {
restart:
hlist_for_each_entry_rcu_bh(ifp, &inet6_addr_lst[i], addr_lst) {