Re: rcu: frequent rcu lockups
From: Paul E. McKenney
Date: Sun Mar 15 2015 - 16:43:17 EST
On Fri, Mar 13, 2015 at 10:39:43AM -0700, Paul E. McKenney wrote:
> On Thu, Mar 12, 2015 at 07:07:49AM -0700, Paul E. McKenney wrote:
> > On Thu, Mar 12, 2015 at 08:32:05AM -0400, Sasha Levin wrote:
> > > On 03/12/2015 08:28 AM, Sasha Levin wrote:
> > > > On 03/11/2015 07:16 PM, Paul E. McKenney wrote:
> > > >> > On Wed, Mar 11, 2015 at 07:06:40PM -0400, Sasha Levin wrote:
> > > >>>> >> > On 03/11/2015 07:01 PM, Paul E. McKenney wrote:
> > > >>>>>>>> >>>> > >> With the commit I didn't hit it yet, but I do see 4 different WARNings:
> > > >>>>>> >>> > > I wish that I could say that I am surprised, but the sad fact is that
> > > >>>>>> >>> > > I am still shaking the bugs out.
> > > >>>> >> >
> > > >>>> >> > I have one more to add:
> > > >>>> >> >
> > > >>>> >> > [ 93.330539] WARNING: CPU: 1 PID: 8 at kernel/rcu/tree_plugin.h:476 rcu_gp_kthread+0x1eaa/0x4dd0()
> > > >> > A bit different, but still in the class of a combining-tree bitmask
> > > >> > handling bug.
> > > > I left it overnight, and am still seeing hangs. Although (and don't catch me
> > > > by that) it seems to be significantly less of them.
> > >
> > > In one of the cases, KASan ended up complaining about something odd going on in
> > > rcu_check_callbacks():
> >
> > Hmmm... Looks like I have a synchronization blow between RCU CPU stall
> > warnings and task exit or some such. Promising clue, thank you!
>
> But I am not seeing it. There should be no synchronization issues
> with sched_show_task(current), as the current task cannot exit while
> it is calling sched_show_task(current). There is a remote-task
> sched_show_task(t), but it is invoked only on tasks that are on RCU's
> list of tasks that have blocked within an RCU read-side critical section
> during the current grace period. Such a task might exit while in in this
> critical section, but in that case exit_rcu() will take care of this, and
> it will acquire the same lock held across the call to sched_show_task(t).
>
> I suppose this could happen if someone re-enabled preemption late
> in the exit() code path and then entered an RCU read-side critical
> section, but that would be a bad breakage of the exit() path.
>
> Am I missing something here?
But I did find a bug that would result in the other warnings, and could
also result in too-short grace periods, which could in turn result in
arbitrarily arbitrary misbehavior. The patch below, which is also on
its way into -next, should fix this. Please let me know how it does
for you.
Thanx, Paul
------------------------------------------------------------------------
rcu: Associate quiescent-state reports with grace period
As noted in earlier commit logs, CPU hotplug operations running
concurrently with grace-period initialization can result in a given
leaf rcu_node structure having all CPUs offline and no blocked readers,
but with this rcu_node structure nevertheless blocking the current
grace period. Therefore, the quiescent-state forcing code now checks
for this situation and repairs it.
Unfortunately, this checking can result in false positives, for example,
when the last task has just removed itself from this leaf rcu_node
structure, but has not yet started clearing the ->qsmask bits further
up the structure. This means that the grace-period kthread (which
forces quiescent states) and some other task might be attempting to
concurrently clear these ->qsmask bits. This is usually not a problem:
One of these tasks will be the first to acquire the upper-level rcu_node
structure's lock and with therefore clear the bit, and the other task,
seeing the bit already cleared, will stop trying to clear bits.
Sadly, this means that the following unusual sequence of events -can-
result in a problem:
1. The grace-period kthread wins, and clears the ->qsmask bits.
2. This is the last thing blocking the current grace period, so
that the grace-period kthread clears ->qsmask bits all the way
to the root and finds that the root ->qsmask field is now zero.
3. Another grace period is required, so that the grace period kthread
initializes it, including setting all the needed qsmask bits.
4. The leaf rcu_node structure (the one that started this whole
mess) is blocking this new grace period, either because it
has at least one online CPU or because there is at least one
task that had blocked within an RCU read-side critical section
while running on one of this leaf rcu_node structure's CPUs.
(And yes, that CPU might well have gone offline before the
grace period in step (3) above started, which can mean that
there is a task on the leaf rcu_node structure's ->blkd_tasks
list, but ->qsmask equal to zero.)
5. The other kthread didn't get around to trying to clear the upper
level ->qsmask bits until all the above had happened. This means
that it now sees bits set in the upper-level ->qsmask field, so it
proceeds to clear them. Too bad that it is doing so on behalf of
a quiescent state that does not apply to the current grace period!
This sequence of events can result in the new grace period being too
short. It can also result in the new grace period ending before the
leaf rcu_node structure's ->qsmask bits have been cleared, which will
result in splats during initialization of the next grace period. In
addition, it can result in tasks blocking the new grace period still
being queued at the start of the next grace period, which will result
in other splats. Sasha's testing turned up another of these splats,
as did rcutorture testing. (And yes, rcutorture is being adjusted to
make these splats show up more quickly. Which probably is having the
undesirable side effect of making other problems show up less quickly.
Can't have everything!)
Reported-by: Sasha Levin <sasha.levin@xxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 664b052f82a9..098046362b39 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2162,14 +2162,14 @@ static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
*/
static void
rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
- struct rcu_node *rnp, unsigned long flags)
+ struct rcu_node *rnp, unsigned long gps, unsigned long flags)
__releases(rnp->lock)
{
struct rcu_node *rnp_c;
/* Walk up the rcu_node hierarchy. */
for (;;) {
- if (!(rnp->qsmask & mask)) {
+ if (!(rnp->qsmask & mask) || rnp->gpnum != gps) {
/* Our bit has already been cleared, so done. */
raw_spin_unlock_irqrestore(&rnp->lock, flags);
@@ -2220,6 +2220,7 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
struct rcu_node *rnp, unsigned long flags)
__releases(rnp->lock)
{
+ unsigned long gps;
unsigned long mask;
struct rcu_node *rnp_p;
@@ -2240,11 +2241,12 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
}
/* Report up the rest of the hierarchy. */
+ gps = rnp->gpnum;
mask = rnp->grpmask;
raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
raw_spin_lock(&rnp_p->lock); /* irqs already disabled. */
smp_mb__after_unlock_lock();
- rcu_report_qs_rnp(mask, rsp, rnp_p, flags);
+ rcu_report_qs_rnp(mask, rsp, rnp_p, gps, flags);
}
/*
@@ -2295,7 +2297,8 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
*/
needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
- rcu_report_qs_rnp(mask, rsp, rnp, flags); /* rlses rnp->lock */
+ rcu_report_qs_rnp(mask, rsp, rnp, rnp->gpnum, flags);
+ /* ^^^ Released rnp->lock */
if (needwake)
rcu_gp_kthread_wake(rsp);
}
@@ -2754,8 +2757,8 @@ static void force_qs_rnp(struct rcu_state *rsp,
}
}
if (mask != 0) {
- /* Idle/offline CPUs, report. */
- rcu_report_qs_rnp(mask, rsp, rnp, flags);
+ /* Idle/offline CPUs, report (releases rnp->lock. */
+ rcu_report_qs_rnp(mask, rsp, rnp, rnp->gpnum, flags);
} else {
/* Nothing to do here, so just drop the lock. */
raw_spin_unlock_irqrestore(&rnp->lock, flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/