Re: lockdep circular locking error (rcu_node_level_0 vs rq->lock)

From: Paul E. McKenney
Date: Thu Jul 14 2011 - 00:29:38 EST


On Wed, Jul 13, 2011 at 09:33:09AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 13, 2011 at 10:24:08AM +0200, Peter Zijlstra wrote:
> > On Tue, 2011-07-12 at 15:54 -0700, Paul E. McKenney wrote:
> > > > @@ -892,6 +892,7 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
> > > > * prev into current:
> > > > */
> > > > spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
> > > > + rcu_read_acquire();
> > >
> > > Oooh... This is a tricky one. Hmmm...
> >
> > <snip>
> >
> > > Does any of this make sense?
> >
> > No?
> >
> > > > @@ -3141,6 +3170,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
> > > > */
> > > > #ifndef __ARCH_WANT_UNLOCKED_CTXSW
> > > > spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
> > > > + rcu_read_release();
> > >
> > > My guess is that we don't need the rcu_read_release() -- the arch shouldn't
> > > care that we have a non-atomic field in task_struct incremented, right?
> > >
> > > Or am I confused about what this is doing?
> >
> > Either that or I used the wrong primitives for what I was meaning to do.
> >
> > So from looking at rcupdate.h rcu_read_{acquire,release} are the lockdep
> > annotations of the rcu_read_lock. The thing we're doing above is context
> > switching and not making lockdep upset.
> >
> > The problem is that held lock state is tracked per task, and we're well
> > switching tasks, so we need to transfer the held lock state from the old
> > to the new task, since the new task will be unlocking the thing, if at
> > that time lockdep finds we're not actually holding it it screams bloody
> > murder.
> >
> > So what we do is we release the lock (annotation only, the actual lock
> > says locked) on prev before switching and acquire the lock on next right
> > after switching.
>
> OK, got it. I think, anyway.
>
> So preemptible RCU and context switches are handled as follows by
> your patch, correct?
>
> 1. The outgoing task is in an RCU read-side critical section
> due to the rcu_read_lock() that you added at the beginning
> of schedule().
>
> 2. In context_switch(), the lockdep state is set so that lockdep
> believes that the task has exited its RCU read-side critical
> section (thus suppressing the lockdep splat that would otherwise
> appear).
>
> 3. Once the context switch has happened, RCU no longer pays
> attention to the outgoing task. So the outgoing task's
> current->rcu_read_lock_nesting still indicates that the task is in
> an RCU read-side critical section, but RCU does not know or care.
>
> 4. If the incoming task has done at least one prior context switch,
> its current->rcu_read_lock_nesting will indicate that it is
> already in an RCU read-side critical section. Since there is
> no opportunity for RCU to report a quiescent state during the
> context switch itself (because holding the runqueue lock keeps
> interrupts disabled), RCU sees the whole context switch event
> as being one big RCU read-side critical section.
>
> If interrupts were enabled at any point, RCU might see this as
> two back-to-back read-side critical sections. This might be
> OK for all I know, but it would open the door more widely to
> heavyweight RCU processing at rcu_read_unlock() time.
>
> 5. Yes, context switch is a quiescent state, but that quiescent
> state happens when rcu_note_context_switch() is called from
> schedule, which is before the runqueue lock is acquired, and
> thus before the beginning of the RCU read-side critical section
> that covers the context switch.
>
> 6. In finish_lock_switch(), the rcu_read_acquire() restores
> lockdep state to match that of the incoming task.
>
> But if the task is newly created, then its value of
> current->rcu_read_lock_nesting will be zero, which will leave
> the incoming task unprotected by RCU. I believe that you
> need the (untested) patch below to make this work properly.
> But this patch needs to be included in your set, as I don't
> immediately see a reasonable way to make it work independently.
> (Yes, we could have a special CONFIG var for the transition,
> but it sounds a lot easier to just have it be part of your patch.)
>
> 7. Exit processing is done by the task itself before a last
> call to schedule(). From what I can see, it should not matter
> that the value of current->rcu_read_lock_nesting gets incremented
> on this last TASK_DEAD call to schedule().
>
> Or am I still confused?

Hmmm... Something is confused. I get boot time hangs with the
occasional stack overflow, whether or not I add my patch on top.
Left to myself, I would try applying your patch incrementally, and
also disabling irqs before rcu_read_lock() and enabling them after
rcu_read_unlock(), as this prevents rcu_read_unlock() from ever getting
to the rcu_read_unlock_special() slowpath.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/