Re: [Issue report] drivers/ftgmac100: DHCP occasionally fails during boot up or link down/up

From: Heyi Guo
Date: Sun Feb 20 2022 - 07:26:25 EST

Next message: Borislav Petkov: "[GIT PULL] locking/urgent for 5.17"
Previous message: Alim Akhtar: "RE: [PATCH] clocksource/drivers/exynos_mct: Remove mct interrupt index enum"
In reply to: Andrew Lunn: "Re: [Issue report] drivers/ftgmac100: DHCP occasionally fails during boot up or link down/up"
Next in thread: Andrew Lunn: "Re: [Issue report] drivers/ftgmac100: DHCP occasionally fails during boot up or link down/up"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Andrew,

There is indeed a dead lock warning after enabling prove_locking:

[   16.852199] ======================================================
[   16.859102] WARNING: possible circular locking dependency detected
[   16.866012] 5.10.36-60b3c9d-dirty-15f4fba #1 Not tainted
[   16.871976] ------------------------------------------------------
[   16.871991] kworker/1:1/23 is trying to acquire lock:
[   16.872000] 80fa0920 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x24/0x28
[   16.872047]
[   16.872047] but task is already holding lock:
[   16.872051] 821d44c0 (&dev->lock){+.+.}-{3:3}, at: phy_state_machine+0x50/0x290
[   16.872076]
[   16.872076] which lock already depends on the new lock.
[   16.872076]
[   16.872080]
[   16.872080] the existing dependency chain (in reverse order) is:
[   16.872083]
[   16.872083] -> #1 (&dev->lock){+.+.}-{3:3}:
[   16.872106]        lock_acquire+0x6c/0x74
[   16.872117]        __mutex_lock+0xb4/0xa48
[   16.872132]        mutex_lock_nested+0x2c/0x34
[   16.872141]        phy_start+0x30/0xc4
[   16.872155]        ftgmac100_open+0x1a0/0x254
[   16.872168]        __dev_open+0x110/0x1d0
[   16.872180]        __dev_change_flags+0x1d0/0x258
[   16.872192]        dev_change_flags+0x28/0x58
[   16.872204]        do_setlink+0x258/0xc60
[   16.872212]        rtnl_setlink+0x110/0x18c
[   16.872219]        rtnetlink_rcv_msg+0x1d0/0x53c
[   16.872226]        netlink_rcv_skb+0xd0/0x128
[   16.872233]        rtnetlink_rcv+0x20/0x24
[   16.872244]        netlink_unicast+0x1a8/0x26c
[   16.872254]        netlink_sendmsg+0x220/0x464
[   16.872265]        __sys_sendto+0xe4/0x134
[   16.872276]        sys_sendto+0x24/0x2c
[   16.872288]        ret_fast_syscall+0x0/0x28
[   16.872297]        0x7ed9e928
[   16.872301]
[   16.872301] -> #0 (rtnl_mutex){+.+.}-{3:3}:
[   16.872325]        __lock_acquire+0x17e8/0x3268
[   16.872331]        lock_acquire.part.0+0xcc/0x394
[   16.872341]        lock_acquire+0x6c/0x74
[   16.872354]        __mutex_lock+0xb4/0xa48
[   16.872365]        mutex_lock_nested+0x2c/0x34
[   16.872377]        rtnl_lock+0x24/0x28
[   16.872389]        ftgmac100_adjust_link+0xc0/0x144
[   16.872401]        phy_link_change+0x38/0x64
[   16.872411]        phy_check_link_status+0xa8/0xfc
[   16.872422]        phy_state_machine+0x80/0x290
[   16.872435]        process_one_work+0x294/0x7d8
[   16.872447]        worker_thread+0x6c/0x548
[   16.872456]        kthread+0x170/0x178
[   16.872462]        ret_from_fork+0x14/0x20
[   16.872467]        0x0
[   16.872471]
[   16.872471] other info that might help us debug this:
[   16.872471]
[   16.872475] Possible unsafe locking scenario:
[   16.872475]
[   16.872478]        CPU0                    CPU1
[   16.872482]        ----                    ----
[   16.872485]   lock(&dev->lock);
[   16.872495]                                lock(rtnl_mutex);
[   16.872505] lock(&dev->lock);
[   16.872513]   lock(rtnl_mutex);
[   16.872522]
[   16.872522] *** DEADLOCK ***
[   16.872522]
[   16.872528] 3 locks held by kworker/1:1/23:
[   16.872532] #0: 818472a8 ((wq_completion)events_power_efficient){+.+.}-{0:0}, at: process_one_work+0x1e8/0x7d8
[   16.872558] #1: 819fbef8 ((work_completion)(&(&dev->state_queue)->work)){+.+.}-{0:0}, at: process_one_work+0x1e8/0x7d8
[   16.872582] #2: 821d44c0 (&dev->lock){+.+.}-{3:3}, at: phy_state_machine+0x50/0x290

Any advice to get around of it?

Thanks,

Heyi

在 2022/2/20 上午2:28, Andrew Lunn 写道:

On Sat, Feb 19, 2022 at 06:08:35PM +0800, Heyi Guo wrote:

Hi Andrew,

The DHCP issue is gone after applying below patch. I put the lock statements
outside of the pure reset function, for the phydev lock has been acquired
before calling adjust_link. The lock order in ftgmac100_reset_task() was
also changed, to make it the same as the lock procedure in adjust_link, in
which the phydev is locked first and then rtnl_lock. I'm not quite sure
whether it will bring in any potential dead lock. Any advice?

Did you run the code with CONFIG_PROVE_LOCKING enabled. That will help
detect possible deadlock situations.

Andrew

Next message: Borislav Petkov: "[GIT PULL] locking/urgent for 5.17"
Previous message: Alim Akhtar: "RE: [PATCH] clocksource/drivers/exynos_mct: Remove mct interrupt index enum"
In reply to: Andrew Lunn: "Re: [Issue report] drivers/ftgmac100: DHCP occasionally fails during boot up or link down/up"
Next in thread: Andrew Lunn: "Re: [Issue report] drivers/ftgmac100: DHCP occasionally fails during boot up or link down/up"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]