Re: w1: 3.14-rc7 - possible recursive locking detected

From: David Fries
Date: Sun Apr 13 2014 - 18:24:59 EST


Belisko Marek,
Here is a possible solution, could you give it a try and report back?

Greg Kroah-Hartman,
Evgeniy asked me to look into this report. I don't have the
reporter's hardware configuration, but I wouldn't think that would be
needed, just some w1 bus master (even W1_MASTER_GPIO might work), then
loading the slave device and manually adding a slave device with that
family id. Even then I didn't reproduce the reported recursive
locking error. I saw unrelated locking reports, but not this one. I
wrote up the included patch, but that undoes the notify changes that
you added earlier in commit 47eba33a0997fc7, and I wanted to ask about
that commit. Specifically these two lines,

err = device_register(&sl->dev);
...
+ dev_set_uevent_suppress(&sl->dev, false);
+ kobject_uevent(&sl->dev.kobj, KOBJ_ADD);

Wouldn't the default be to not suppress? Nothing in the W1 system
enables suppressing so is that even needed? (I'm fine with saying
it's a good idea).
device_register at some point must call device_add and device_add
calls kobject_uevent(&dev->kobj, KOBJ_ADD);
so doesn't the KOBJ_ADD send the add a second time? As in it
shouldn't be needed?
Can the suppress be called before device_register to avoid the
automatic notify, then after it returns setup the slave device as this
patch does to avoid this problem report, and then call the KOBJ_ADD to
make everything happy?

I'm based on commit ce7613db2d8d4d5af2587a.

On Tue, Apr 08, 2014 at 09:47:50PM +0200, Belisko Marek wrote:
> Hi,
>
> On Mon, Mar 24, 2014 at 1:01 PM, Evgeniy Polyakov <zbr@xxxxxxxxxxx> wrote:
> > Hi everyone
> >
> > 24.03.2014, 01:50, "David Fries" <David@xxxxxxxxx>:
> >> On Mon, Mar 17, 2014 at 10:38:13PM +0100, Belisko Marek wrote:
> >>> booting latest 3.14-rc7 on gta04 board gives following warning:
> >>>
> >>> [ 3.101409] Driver for 1-wire Dallas network protocol.
> >>> [ 3.109649] omap_hdq omap_hdq: OMAP HDQ Hardware Rev 0.5. Driver in
> >>> [ 3.400146] CPU0
> >>> [ 3.402709] ----
> >>> [ 3.405273] lock(&(&priv->bus_notifier)->rwsem);
> >>> [ 3.410308] lock(&(&priv->bus_notifier)->rwsem);
> >>> [ 3.415374]
> >>> [ 3.415374] *** DEADLOCK ***
> >>>
> >>> AFAIK we didn't see that on (at least 3/14-rc2).
> >>
> >> The last drivers/w1 change went in before v3.14-rc1, so if it is
> >> something in the w1 code, it has either been there or something else
> >> in the rest of the kernel changed to expose it. I'm using the ds2490
> >> USB w1 dongle, I didn't see that problem when enabling lock debugging
> >> on 3.14.0-rc5+ which includes some changes I'm working on.
> >>
> >> https://github.com/dfries/linux.git w1_rework
> >
> > Can it be a bug in omap w1 driver?
> We test with actual Linus master and it's even worse (it hangs completely).
> Trace look little bit different and it only happens when booting via
> device tree (not with board file).
>
> [ 2.807434] Driver for 1-wire Dallas network protocol.
> [ 2.814117] omap_hdq omap_hdq: OMAP HDQ Hardware Rev 0.5. Driver in
> Interrupt mode
> [ 3.034271] call_modprobe: w1-family-0x1
> [ 3.051818]
> [ 3.053405] =============================================
> [ 3.059051] [ INFO: possible recursive locking detected ]
> [ 3.064727] 3.14.0-gta04 #556 Not tainted
> [ 3.068939] ---------------------------------------------
> [ 3.074615] w1_bus_master1/795 is trying to acquire lock:
> [ 3.080261] (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
> [<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
> [ 3.090911]
> [ 3.090911] but task is already holding lock:
> [ 3.097045] (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
> [<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
> [ 3.107666]
> [ 3.107666] other info that might help us debug this:
> [ 3.114501] Possible unsafe locking scenario:
> [ 3.114501]
> [ 3.120727] CPU0
> [ 3.123291] ----
> [ 3.125854] lock(&(&priv->bus_notifier)->rwsem);
> [ 3.130889] lock(&(&priv->bus_notifier)->rwsem);
> [ 3.135925]
> [ 3.135925] *** DEADLOCK ***
> [ 3.135925]
> [ 3.142150] May be due to missing lock nesting notation
> [ 3.142150]
> [ 3.149291] 2 locks held by w1_bus_master1/795:
> [ 3.154052] #0: (&dev->mutex#3){+.+.+.}, at: [<c03ba1c8>]
> w1_attach_slave_device+0xc4/0x1c8
> [ 3.163055] #1: (&(&priv->bus_notifier)->rwsem){.+.+.+}, at:
> [<c0058ab0>] __blocking_notifier_call_chain+0x28/0x58
> [ 3.174163]
> [ 3.174163] stack backtrace:
> [ 3.178741] CPU: 0 PID: 795 Comm: w1_bus_master1 Not tainted
> 3.14.0-gta04 #556
> [ 3.186370] [<c0013ca0>] (unwind_backtrace) from [<c0010f68>]
> (show_stack+0x10/0x14)
> [ 3.194488] [<c0010f68>] (show_stack) from [<c04f0900>]
> (dump_stack+0x6c/0x88)
> [ 3.202117] [<c04f0900>] (dump_stack) from [<c00746bc>]
> (print_deadlock_bug+0xc0/0xf0)
> [ 3.210449] [<c00746bc>] (print_deadlock_bug) from [<c0075f80>]
> (validate_chain.isra.29+0x4dc/0x534)
> [ 3.220031] [<c0075f80>] (validate_chain.isra.29) from [<c0076dac>]
> (__lock_acquire+0x728/0x834)
> [ 3.229248] [<c0076dac>] (__lock_acquire) from [<c00774b8>]
> (lock_acquire+0xf4/0x118)
> [ 3.237487] [<c00774b8>] (lock_acquire) from [<c04f8e04>]
> (down_read+0x24/0x34)
> [ 3.245178] [<c04f8e04>] (down_read) from [<c0058ab0>]
> (__blocking_notifier_call_chain+0x28/0x58)
> [ 3.254516] [<c0058ab0>] (__blocking_notifier_call_chain) from
> [<c0058af4>] (blocking_notifier_call_chain+0x14/0x18)
> [ 3.265563] [<c0058af4>] (blocking_notifier_call_chain) from
> [<c0301db8>] (device_add+0x1f4/0x368)
> [ 3.274993] [<c0301db8>] (device_add) from [<c03055cc>]
> (platform_device_add+0x138/0x1c8)
> [ 3.283569] [<c03055cc>] (platform_device_add) from [<c03bd124>]
> (w1_bq27000_add_slave+0x44/0x74)
> [ 3.292907] [<c03bd124>] (w1_bq27000_add_slave) from [<c03b9ed8>]
> (w1_bus_notify.part.0+0x48/0xc8)
> [ 3.302337] [<c03b9ed8>] (w1_bus_notify.part.0) from [<c04fbe1c>]
> (notifier_call_chain+0x38/0x68)
> [ 3.311645] [<c04fbe1c>] (notifier_call_chain) from [<c0058acc>]
> (__blocking_notifier_call_chain+0x44/0x58)
> [ 3.321899] [<c0058acc>] (__blocking_notifier_call_chain) from
> [<c0058af4>] (blocking_notifier_call_chain+0x14/0x18)
> [ 3.332946] [<c0058af4>] (blocking_notifier_call_chain) from
> [<c0301db8>] (device_add+0x1f4/0x368)
> [ 3.342346] [<c0301db8>] (device_add) from [<c03b9ad0>]
> (__w1_attach_slave_device+0x9c/0x134)
> [ 3.351318] [<c03b9ad0>] (__w1_attach_slave_device) from
> [<c03ba238>] (w1_attach_slave_device+0x134/0x1c8)
> [ 3.361480] [<c03ba238>] (w1_attach_slave_device) from [<c03ba348>]
> (w1_slave_found+0x7c/0x98)
> [ 3.370513] [<c03ba348>] (w1_slave_found) from [<c03bc818>]
> (omap_w1_search_bus+0x54/0x5c)
> [ 3.379211] [<c03bc818>] (omap_w1_search_bus) from [<c03bc758>]
> (w1_search_devices+0x3c/0x48)
> [ 3.388153] [<c03bc758>] (w1_search_devices) from [<c03baaec>]
> (w1_search_process_cb+0x64/0x108)
> [ 3.397399] [<c03baaec>] (w1_search_process_cb) from [<c03bac60>]
> (w1_process+0x68/0x14c)
> [ 3.405975] [<c03bac60>] (w1_process) from [<c0054bdc>] (kthread+0xdc/0xf0)
> [ 3.413299] [<c0054bdc>] (kthread) from [<c000dca8>]
> (ret_from_fork+0x14/0x2c)
> [ 24.451049] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> (detected by 0, t=2102 jiffies, g=4294967086, c=4294967085, q=19)
> [ 24.463134] INFO: Stall ended before state dump start
> [ 87.501037] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> (detected by 0, t=8407 jiffies, g=4294967086, c=4294967085, q=19)
> [ 87.513122] INFO: Stall ended before state dump start
> [ 150.551055] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> (detected by 0, t=14712 jiffies, g=4294967086, c=4294967085, q=19)
> [ 150.563232] INFO: Stall ended before state dump start
> [ 213.601043] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> (detected by 0, t=21017 jiffies, g=4294967086, c=4294967085, q=19)
> [ 213.613220] INFO: Stall ended before state dump start
> [ 276.651062] INFO: rcu_sched detected stalls on CPUs/tasks: {}
> (detected by 0, t=27322 jiffies, g=4294967086, c=4294967085, q=19)
> [ 276.663238] INFO: Stall ended before state dump start
>
> BR,
>
> marek
>
> --
> as simple and primitive as possible
> -------------------------------------------------
> Marek Belisko - OPEN-NANDRA
> Freelance Developer
>
> Ruska Nova Ves 219 | Presov, 08005 Slovak Republic
> Tel: +421 915 052 184
> skype: marekwhite
> twitter: #opennandra
> web: http://open-nandra.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
David Fries <david@xxxxxxxxx> PGP pub CB1EE8F0
http://fries.net/~david/


commit b3afc24a3ca6c261a8fe2c08b5bbccb51233c09a
Author: David Fries <David@xxxxxxxxx>
Date: Sun Apr 13 15:36:45 2014 -0500

w1: avoid recursive device_add

__w1_attach_slave_device calls device_add which calls w1_bus_notify
which calls for the w1_bq27000 slave driver, which calls
platform_device_add and device_add and deadlocks on getting
&(&priv->bus_notifier)->rwsem as it is still held in the previous
device_add. This avoids the problem by processing the family
add/remove outside of the slave device_add call. This does mean that
the slave device isn't setup when the notification that a slave is
available goes out.

Reported-by: Belisko Marek <marek.belisko@xxxxxxxxx>

diff --git a/drivers/w1/w1.c b/drivers/w1/w1.c
index b96f61b..dde09c1 100644
--- a/drivers/w1/w1.c
+++ b/drivers/w1/w1.c
@@ -614,27 +614,11 @@ end:
return err;
}

-/*
- * Handle sysfs file creation and removal here, before userspace is told that
- * the device is added / removed from the system
- */
-static int w1_bus_notify(struct notifier_block *nb, unsigned long action,
- void *data)
+static int w1_family_notify(unsigned long action, struct w1_slave *sl)
{
- struct device *dev = data;
- struct w1_slave *sl;
struct w1_family_ops *fops;
int err;

- /*
- * Only care about slave devices at the moment. Yes, we should use a
- * separate "type" for this, but for now, look at the release function
- * to know which type it is...
- */
- if (dev->release != w1_slave_release)
- return 0;
-
- sl = dev_to_w1_slave(dev);
fops = sl->family->fops;

if (!fops)
@@ -673,10 +657,6 @@ static int w1_bus_notify(struct notifier_block *nb, unsigned long action,
return 0;
}

-static struct notifier_block w1_bus_nb = {
- .notifier_call = w1_bus_notify,
-};
-
static int __w1_attach_slave_device(struct w1_slave *sl)
{
int err;
@@ -705,6 +685,7 @@ static int __w1_attach_slave_device(struct w1_slave *sl)
dev_name(&sl->dev), err);
return err;
}
+ w1_family_notify(BUS_NOTIFY_ADD_DEVICE, sl);


dev_set_uevent_suppress(&sl->dev, false);
@@ -799,6 +780,7 @@ int w1_unref_slave(struct w1_slave *sl)
msg.type = W1_SLAVE_REMOVE;
w1_netlink_send(sl->master, &msg);

+ w1_family_notify(BUS_NOTIFY_DEL_DEVICE, sl);
device_unregister(&sl->dev);
#ifdef DEBUG
memset(sl, 0, sizeof(*sl));
@@ -1186,10 +1168,6 @@ static int __init w1_init(void)
goto err_out_exit_init;
}

- retval = bus_register_notifier(&w1_bus_type, &w1_bus_nb);
- if (retval)
- goto err_out_bus_unregister;
-
retval = driver_register(&w1_master_driver);
if (retval) {
printk(KERN_ERR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/