Re: [Issue report] drivers/ftgmac100: DHCP occasionally fails during boot up or link down/up
From: Heyi Guo
Date: Sun Feb 20 2022 - 07:26:25 EST
Hi Andrew,
There is indeed a dead lock warning after enabling prove_locking:
[ 16.852199] ======================================================
[ 16.859102] WARNING: possible circular locking dependency detected
[ 16.866012] 5.10.36-60b3c9d-dirty-15f4fba #1 Not tainted
[ 16.871976] ------------------------------------------------------
[ 16.871991] kworker/1:1/23 is trying to acquire lock:
[ 16.872000] 80fa0920 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x24/0x28
[ 16.872047]
[ 16.872047] but task is already holding lock:
[ 16.872051] 821d44c0 (&dev->lock){+.+.}-{3:3}, at:
phy_state_machine+0x50/0x290
[ 16.872076]
[ 16.872076] which lock already depends on the new lock.
[ 16.872076]
[ 16.872080]
[ 16.872080] the existing dependency chain (in reverse order) is:
[ 16.872083]
[ 16.872083] -> #1 (&dev->lock){+.+.}-{3:3}:
[ 16.872106] lock_acquire+0x6c/0x74
[ 16.872117] __mutex_lock+0xb4/0xa48
[ 16.872132] mutex_lock_nested+0x2c/0x34
[ 16.872141] phy_start+0x30/0xc4
[ 16.872155] ftgmac100_open+0x1a0/0x254
[ 16.872168] __dev_open+0x110/0x1d0
[ 16.872180] __dev_change_flags+0x1d0/0x258
[ 16.872192] dev_change_flags+0x28/0x58
[ 16.872204] do_setlink+0x258/0xc60
[ 16.872212] rtnl_setlink+0x110/0x18c
[ 16.872219] rtnetlink_rcv_msg+0x1d0/0x53c
[ 16.872226] netlink_rcv_skb+0xd0/0x128
[ 16.872233] rtnetlink_rcv+0x20/0x24
[ 16.872244] netlink_unicast+0x1a8/0x26c
[ 16.872254] netlink_sendmsg+0x220/0x464
[ 16.872265] __sys_sendto+0xe4/0x134
[ 16.872276] sys_sendto+0x24/0x2c
[ 16.872288] ret_fast_syscall+0x0/0x28
[ 16.872297] 0x7ed9e928
[ 16.872301]
[ 16.872301] -> #0 (rtnl_mutex){+.+.}-{3:3}:
[ 16.872325] __lock_acquire+0x17e8/0x3268
[ 16.872331] lock_acquire.part.0+0xcc/0x394
[ 16.872341] lock_acquire+0x6c/0x74
[ 16.872354] __mutex_lock+0xb4/0xa48
[ 16.872365] mutex_lock_nested+0x2c/0x34
[ 16.872377] rtnl_lock+0x24/0x28
[ 16.872389] ftgmac100_adjust_link+0xc0/0x144
[ 16.872401] phy_link_change+0x38/0x64
[ 16.872411] phy_check_link_status+0xa8/0xfc
[ 16.872422] phy_state_machine+0x80/0x290
[ 16.872435] process_one_work+0x294/0x7d8
[ 16.872447] worker_thread+0x6c/0x548
[ 16.872456] kthread+0x170/0x178
[ 16.872462] ret_from_fork+0x14/0x20
[ 16.872467] 0x0
[ 16.872471]
[ 16.872471] other info that might help us debug this:
[ 16.872471]
[ 16.872475] Possible unsafe locking scenario:
[ 16.872475]
[ 16.872478] CPU0 CPU1
[ 16.872482] ---- ----
[ 16.872485] lock(&dev->lock);
[ 16.872495] lock(rtnl_mutex);
[ 16.872505] lock(&dev->lock);
[ 16.872513] lock(rtnl_mutex);
[ 16.872522]
[ 16.872522] *** DEADLOCK ***
[ 16.872522]
[ 16.872528] 3 locks held by kworker/1:1/23:
[ 16.872532] #0: 818472a8
((wq_completion)events_power_efficient){+.+.}-{0:0}, at:
process_one_work+0x1e8/0x7d8
[ 16.872558] #1: 819fbef8
((work_completion)(&(&dev->state_queue)->work)){+.+.}-{0:0}, at:
process_one_work+0x1e8/0x7d8
[ 16.872582] #2: 821d44c0 (&dev->lock){+.+.}-{3:3}, at:
phy_state_machine+0x50/0x290
Any advice to get around of it?
Thanks,
Heyi
在 2022/2/20 上午2:28, Andrew Lunn 写道:
On Sat, Feb 19, 2022 at 06:08:35PM +0800, Heyi Guo wrote:
Hi Andrew,
The DHCP issue is gone after applying below patch. I put the lock statements
outside of the pure reset function, for the phydev lock has been acquired
before calling adjust_link. The lock order in ftgmac100_reset_task() was
also changed, to make it the same as the lock procedure in adjust_link, in
which the phydev is locked first and then rtnl_lock. I'm not quite sure
whether it will bring in any potential dead lock. Any advice?
Did you run the code with CONFIG_PROVE_LOCKING enabled. That will help
detect possible deadlock situations.
Andrew