Re: Lockdep complains about commit 1331e7a1bb ("rcu: Remove_rcu_barrier() dependency on __stop_machine()")

From: Jiri Kosina
Date: Tue Oct 02 2012 - 17:59:09 EST


On Tue, 2 Oct 2012, Jiri Kosina wrote:

> > > > 1331e7a1bbe1f11b19c4327ba0853bee2a606543 is the first bad commit
> > > > commit 1331e7a1bbe1f11b19c4327ba0853bee2a606543
> > > > Author: Paul E. McKenney <paul.mckenney@xxxxxxxxxx>
> > > > Date: Thu Aug 2 17:43:50 2012 -0700
> > > >
> > > > rcu: Remove _rcu_barrier() dependency on __stop_machine()
> > > >
> > > > Currently, _rcu_barrier() relies on preempt_disable() to prevent
> > > > any CPU from going offline, which in turn depends on CPU hotplug's
> > > > use of __stop_machine().
> > > >
> > > > This patch therefore makes _rcu_barrier() use get_online_cpus() to
> > > > block CPU-hotplug operations. This has the added benefit of removing
> > > > the need for _rcu_barrier() to adopt callbacks: Because CPU-hotplug
> > > > operations are excluded, there can be no callbacks to adopt. This
> > > > commit simplifies the code accordingly.
> > > >
> > > > Signed-off-by: Paul E. McKenney <paul.mckenney@xxxxxxxxxx>
> > > > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> > > > Reviewed-by: Josh Triplett <josh@xxxxxxxxxxxxxxxx>
> > > > ==
> > > >
> > > > is causing lockdep to complain (see the full trace below). I haven't yet
> > > > had time to analyze what exactly is happening, and probably will not have
> > > > time to do so until tomorrow, so just sending this as a heads-up in case
> > > > anyone sees the culprit immediately.
> > >
> > > Hmmm... Does the following patch help? It swaps the order in which
> > > rcu_barrier() acquires the hotplug and rcu_barrier locks.
> >
> > It changed the report slightly (see for example the change in possible
> > unsafe locking scenario, rcu_sched_state.barrier_mutex vanished and it's
> > now directly about cpu_hotplug.lock). With the patch applied I get
> >
> >
> >
> > ======================================================
> > [ INFO: possible circular locking dependency detected ]
> > 3.6.0-03888-g3f99f3b #145 Not tainted
>
> And it really seems valid.
>
> kmem_cache_destroy() calls rcu_barrier() with slab_mutex locked, which
> introduces slab_mutex -> cpu_hotplug.lock dependency (through
> rcu_barrier() -> _rcu_barrier() -> get_online_cpus()).
>
> On the other hand, _cpu_up() acquires cpu_hotplug.lock through
> cpu_hotplug_begin(), and with this lock held cpuup_callback() notifier
> gets called, which acquires slab_mutex. This gives the reverse dependency,
> i.e. deadlock scenario is valid one.
>
> 1331e7a1bbe1f11b19c4327ba0853bee2a606543 is triggering this, because
> before that, there was no slab_mutex -> cpu_hotplug.lock dependency.
>
> Simply put, the commit causes get_online_cpus() to be called with
> slab_mutex held, which is invalid.

Oh, and it seems to be actually triggering in real.

With HEAD being 974a847e00c, machine suspends nicely. With 974a847e00c +
your patch, changing the order in which rcu_barrier() acquires hotplug and
rcu_barrier locks, the machine hangs 100% reliably during suspend, which
very likely actually is the deadlock described above.

--
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/