Hi Guenter,
This patch results in the attached lockdep splat when running the
ast2600-evb emulation in qemu with aspeed_g5_defconfig and lock debugging
enabled. Reverting this patch fixes the problem.
Umm ... That's only true if you think the problem is the lockdep splat,
rather than the actual potential deadlock?!
[ 9.809960] ======================================================
[ 9.810053] WARNING: possible circular locking dependency detected
[ 9.810196] 6.6.0-rc1-00004-g6faca50f629f #1 Tainted: G N
I don't have this exact tree, but on 6.6-rc1,
[ 9.810327] ------------------------------------------------------
[ 9.810406] ip/357 is trying to acquire lock:
[ 9.810501] 83af6c40 ((work_completion)(&(&dev->state_queue)->work)){+.+.}-{0:0}, at: __flush_work+0x40/0x550
[ 9.811052]
[ 9.811052] but task is already holding lock:
[ 9.811133] 81639924 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x124/0x514
[ 9.811264]
[ 9.811264] which lock already depends on the new lock.
[ 9.811264]
[ 9.811361]
[ 9.811361] the existing dependency chain (in reverse order) is:
[ 9.811466]
[ 9.811466] -> #1 (rtnl_mutex){+.+.}-{3:3}:
[ 9.811616] lock_acquire+0xfc/0x368
[ 9.811717] __mutex_lock+0x90/0xf00
[ 9.811782] mutex_lock_nested+0x24/0x2c
[ 9.811845] ftgmac100_reset+0x1c/0x1dc
This does indeed take the RTNL:
static void ftgmac100_reset(struct ftgmac100 *priv)
{
struct net_device *netdev = priv->netdev;
int err;
netdev_dbg(netdev, "Resetting NIC...\n");
/* Lock the world */
rtnl_lock();
and is called from
[ 9.811907] ftgmac100_adjust_link+0xc0/0x13c
[ 9.811972] phy_link_change+0x30/0x5c
[ 9.812035] phy_check_link_status+0x9c/0x11c
[ 9.812100] phy_state_machine+0x1c0/0x2c0
this work (phy_state_machine is the function), which
[ 9.812405] -> #0 ((work_completion)(&(&dev->state_queue)->work)){+.+.}-{0:0}:
[ 9.812531] check_prev_add+0x128/0x15ec
[ 9.812594] __lock_acquire+0x16ec/0x20cc
[ 9.812656] lock_acquire+0xfc/0x368
[ 9.812712] __flush_work+0x70/0x550
[ 9.812769] __cancel_work_timer+0x1e4/0x264
[ 9.812833] phy_stop+0x78/0x128
is cancelled by phy_stop() in phy_stop_machine():
void phy_stop_machine(struct phy_device *phydev)
{
cancel_delayed_work_sync(&phydev->state_queue);
but of course that's called by the driver under RTNL:
[ 9.812889] ftgmac100_stop+0x5c/0xac
[ 9.812949] __dev_close_many+0xb8/0x140
(__dev_close_many requires RTNL)
So you have a potential deadlock in this driver. Yes, workqueue items
and RTNL are basically incompatible. Don't do that. Now this bug was
_probably_ added by commit 1baf2e50e48f ("drivers/net/ftgmac100: fix
DHCP potential failure with systemd") which added a call to
ftgmac100_reset() in ftgmac100_adjust_link() which is the thing called
from the PHY state machine in the first place.
Should that be reverted? I don't know ... maybe it can be fixed
differently.
But anyway ... as far as lockdep/workqueue stuff is concerned I'd
definitely call it a win rather than a bug! Yay for making lockdep
useful - it found a deadlock situation for you! :-) No need to blame
lockdep for that :P