Re: [PATCH 5/8] rcu: eliminate deadlock for rcu read site

From: Paul E. McKenney
Date: Fri Aug 09 2013 - 13:59:32 EST


On Fri, Aug 09, 2013 at 05:31:27PM +0800, Lai Jiangshan wrote:
> On 08/09/2013 04:40 AM, Paul E. McKenney wrote:
> > On Wed, Aug 07, 2013 at 06:25:01PM +0800, Lai Jiangshan wrote:
> >> Background)
> >>
> >> Although all articles declare that rcu read site is deadlock-immunity.
> >> It is not true for rcu-preempt, it will be deadlock if rcu read site
> >> overlaps with scheduler lock.
> >>
> >> ec433f0c, 10f39bb1 and 016a8d5b just partially solve it. But rcu read site
> >> is still not deadlock-immunity. And the problem described in 016a8d5b
> >> is still existed(rcu_read_unlock_special() calls wake_up).
> >>
> >> Aim)
> >>
> >> We want to fix the problem forever, we want to keep rcu read site
> >> is deadlock-immunity as books say.
> >>
> >> How)
> >>
> >> The problem is solved by "if rcu_read_unlock_special() is called inside
> >> any lock which can be (chained) nested in rcu_read_unlock_special(),
> >> we defer rcu_read_unlock_special()".
> >> This kind locks include rnp->lock, scheduler locks, perf ctx->lock, locks
> >> in printk()/WARN_ON() and all locks nested in these locks or chained nested
> >> in these locks.
> >>
> >> The problem is reduced to "how to distinguish all these locks(context)",
> >> We don't distinguish all these locks, we know that all these locks
> >> should be nested in local_irqs_disable().
> >>
> >> we just consider if rcu_read_unlock_special() is called in irqs-disabled
> >> context, it may be called in these suspect locks, we should defer
> >> rcu_read_unlock_special().
> >>
> >> The algorithm enlarges the probability of deferring, but the probability
> >> is still very very low.
> >>
> >> Deferring does add a small overhead, but it offers us:
> >> 1) really deadlock-immunity for rcu read site
> >> 2) remove the overhead of the irq-work(250 times per second in avg.)
> >
> > One problem here -- it may take quite some time for a set_need_resched()
> > to take effect. This is especially a problem for RCU priority boosting,
> > but can also needlessly delay preemptible-RCU grace periods because
> > local_irq_restore() and friends don't check the TIF_NEED_RESCHED bit.
>
> The final effect of deboosting(rt_mutex_unlock()) is also accomplished
> via set_need_resched()/set_tsk_need_resched().
> set_need_resched() is enough for RCU priority boosting issue here.

Eventually, yes. But all that set_need_resched() does is set the
TIF_NEED_RESCHED. This is checked by the outermost preempt_enable(),
return from interrupt, return to userspace, and things like
cond_resched(). So it might well be quite some time until the boosted
reader actually gets around to deboosting itself.

> Since rcu_read_unlock_special() is deferred, it does take quite some time for
> QS report to take effect.

Agreed.

> > OK, alternatives...
> >
> > o Keep the current rule saying that if the scheduler is going
> > to exit an RCU read-side critical section while holding
> > one of its spinlocks, preemption has to have been disabled
>
> Since rtmutex'lock->wait_lock is not irqs-disabled nor bh-disabled.

Yep, because this rule prevents the call to rt_mutex_unlock() from
happening whenever one of the scheduler's irq-disable locks is held.

> This kind of spinlocks include scheduler locks, rtmutex'lock->wait_lock,
> all locks can be acquired in irq/SOFTIRQ.
>
> So this rule is not only applied for scheduler locks, it should also
> be applied for almost all spinlocks in the kernel.

No, only those locks acquired by the scheduler in the wakeup path.
Or am I missing something here?

> I was hard to accept that rcu read site is not deadlock-immunity.

So did I, see http://lwn.net/Articles/453002/. RCU was a victim of its
own success. ;-)

And it would be really cool to restore full deadlock immunity to
rcu_read_unlock(), no two ways about it! Hmmm... Any way that a
self-IPI could be made safe for all architectures?

Thanx, Paul

> Thanks,
> Lai
>
> > throughout the full duration of that critical section.
> > Well, we can certainly do this, but it would be nice to get
> > rid of this rule.
> >
> > o Use per-CPU variables, possibly injecting delay. This has ugly
> > disadvantages as noted above.
> >
> > o irq_work_queue() can wait a jiffy (or on some architectures,
> > quite a bit longer) before actually doing anything.
> >
> > o raise_softirq() is more immediate and is an easy change, but
> > adds a softirq vector -- which people are really trying to
> > get rid of. Also, wakeup_softirqd() calls things that acquire
> > the scheduler locks, which is exactly what we were trying to
> > avoid doing.
> >
> > o invoke_rcu_core() can invoke raise_softirq() as above.
> >
> > o IPI to self. From what I can see, not all architectures
> > support this. Easy to fake if you have at least two CPUs,
> > but not so good from an OS jitter viewpoint...
> >
> > o Add a check to local_irq_disable() and friends. I would guess
> > that this suggestion would not make architecture maintainers
> > happy.
> >
> > Other thoughts?
> >
> > Thanx, Paul
> >
> >> Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxx>
> >> ---
> >> include/linux/rcupdate.h | 2 +-
> >> kernel/rcupdate.c | 2 +-
> >> kernel/rcutree_plugin.h | 47 +++++++++++++++++++++++++++++++++++++++++----
> >> 3 files changed, 44 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >> index 4b14bdc..00b4220 100644
> >> --- a/include/linux/rcupdate.h
> >> +++ b/include/linux/rcupdate.h
> >> @@ -180,7 +180,7 @@ extern void synchronize_sched(void);
> >>
> >> extern void __rcu_read_lock(void);
> >> extern void __rcu_read_unlock(void);
> >> -extern void rcu_read_unlock_special(struct task_struct *t);
> >> +extern void rcu_read_unlock_special(struct task_struct *t, bool unlock);
> >> void synchronize_rcu(void);
> >>
> >> /*
> >> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> >> index cce6ba8..33b89a3 100644
> >> --- a/kernel/rcupdate.c
> >> +++ b/kernel/rcupdate.c
> >> @@ -90,7 +90,7 @@ void __rcu_read_unlock(void)
> >> #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
> >> barrier(); /* assign before ->rcu_read_unlock_special load */
> >> if (unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
> >> - rcu_read_unlock_special(t);
> >> + rcu_read_unlock_special(t, true);
> >> barrier(); /* ->rcu_read_unlock_special load before assign */
> >> t->rcu_read_lock_nesting = 0;
> >> }
> >> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> >> index fc8b36f..997b424 100644
> >> --- a/kernel/rcutree_plugin.h
> >> +++ b/kernel/rcutree_plugin.h
> >> @@ -242,15 +242,16 @@ static void rcu_preempt_note_context_switch(int cpu)
> >> ? rnp->gpnum
> >> : rnp->gpnum + 1);
> >> raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >> - } else if (t->rcu_read_lock_nesting < 0 &&
> >> - !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN) &&
> >> - t->rcu_read_unlock_special) {
> >> + } else if (t->rcu_read_lock_nesting == 0 ||
> >> + (t->rcu_read_lock_nesting < 0 &&
> >> + !WARN_ON_ONCE(t->rcu_read_lock_nesting != INT_MIN))) {
> >>
> >> /*
> >> * Complete exit from RCU read-side critical section on
> >> * behalf of preempted instance of __rcu_read_unlock().
> >> */
> >> - rcu_read_unlock_special(t);
> >> + if (t->rcu_read_unlock_special)
> >> + rcu_read_unlock_special(t, false);
> >> }
> >>
> >> /*
> >> @@ -333,7 +334,7 @@ static struct list_head *rcu_next_node_entry(struct task_struct *t,
> >> * notify RCU core processing or task having blocked during the RCU
> >> * read-side critical section.
> >> */
> >> -void rcu_read_unlock_special(struct task_struct *t)
> >> +void rcu_read_unlock_special(struct task_struct *t, bool unlock)
> >> {
> >> int empty;
> >> int empty_exp;
> >> @@ -364,6 +365,42 @@ void rcu_read_unlock_special(struct task_struct *t)
> >>
> >> /* Clean up if blocked during RCU read-side critical section. */
> >> if (special & RCU_READ_UNLOCK_BLOCKED) {
> >> + /*
> >> + * If rcu read lock overlaps with scheduler lock,
> >> + * rcu_read_unlock_special() may lead to deadlock:
> >> + *
> >> + * rcu_read_lock();
> >> + * preempt_schedule[_irq]() (when preemption)
> >> + * scheduler lock; (or some other locks can be (chained) nested
> >> + * in rcu_read_unlock_special()/rnp->lock)
> >> + * access and check rcu data
> >> + * rcu_read_unlock();
> >> + * rcu_read_unlock_special();
> >> + * wake_up(); DEAD LOCK
> >> + *
> >> + * To avoid all these kinds of deadlock, we should quit
> >> + * rcu_read_unlock_special() here and defer it to
> >> + * rcu_preempt_note_context_switch() or next outmost
> >> + * rcu_read_unlock() if we consider this case may happen.
> >> + *
> >> + * Although we can't know whether current _special()
> >> + * is nested in scheduler lock or not. But we know that
> >> + * irqs are always disabled in this case. so we just quit
> >> + * and defer it to rcu_preempt_note_context_switch()
> >> + * when irqs are disabled.
> >> + *
> >> + * It means we always defer _special() when it is
> >> + * nested in irqs disabled context, but
> >> + * (special & RCU_READ_UNLOCK_BLOCKED) &&
> >> + * irqs_disabled_flags(flags)
> >> + * is still unlikely to be true.
> >> + */
> >> + if (unlikely(unlock && irqs_disabled_flags(flags))) {
> >> + set_need_resched();
> >> + local_irq_restore(flags);
> >> + return;
> >> + }
> >> +
> >> t->rcu_read_unlock_special &= ~RCU_READ_UNLOCK_BLOCKED;
> >>
> >> /*
> >> --
> >> 1.7.4.4
> >>
> >
> >
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/