Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline

From: Paul E. McKenney
Date: Wed Jun 27 2018 - 11:41:25 EST


On Wed, Jun 27, 2018 at 10:33:35AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 04:40:04PM -0700, Paul E. McKenney wrote:
> > The options I have considered are as follows:
> >
> > 1. Stick with the no-failsafe approach, adding the lock as shown
> > in this patch. (I obviously prefer this approach.)
> >
> > 2. Stick with the no-failsafe approach, but rely on RCU's grace-period
> > kthread to wake up later due to its timed wait during the
> > force-quiescent-state process. This would be a bit obnoxious,
> > as it requires passing a don't-wake flag (or some such) up the
> > quiescent-state reporting mechanism. It would also needlessly
> > delay grace-period ends, especially on large systems (RCU scales
> > up the FQS delay on larger systems to maintain limited CPU
> > consumption per unit time).
> >
> > 3. Stick with the no-failsafe approach, but have the quiescent-state
> > reporting code hand back a value indicating that a wakeup is needed.
> > Also a bit obnoxious, as this value would need to be threaded up
> > the reporting code's return path. Simple in theory, but a bit
> > of an ugly change, especially for the many places in the code that
> > currently expect quiescent-state reporting to be an unconditional
> > fire-and-forget operation.
>
> You can combine 2 and 3. Use a skip wakeup flag and ignore the return
> value most times. Let me do that just to see how horrible it is.
>
> >
> > 4. Re-introduce the old fail-safe code, and don't report the
> > quiescent state at CPU-offline time, relying on the fail-safe
> > code to do so. This also potentially introduces delays and can
> > add extra FQS scans, which in turn increases lock contention.
> >
> > So what did you have in mind?
>
> The thing I talked about last night before crashing is the patch below;
> it does however suffer from a little false-negative, much like the one
> you explained earlier. It allows @qsmaskinit to retain the bit set after
> offline.
>
> I had hoped to be able to clear @qsmaskinit unconditionally, but that
> doesn't quite work.

Yes, unless you are insanely careful (and possess an unusual tolerance for
complexity), you will end up with inconsistent ->qsmask fields, which will
get you too-short grace periods, grace-period hangs, or maybe even both.

For one thing, whatever code sets/clears a leaf rcu_node structure's
->qsmaskinit must propagate that change up the tree. If that code is
not grace-period initialization, then that code must somehow synchronize
correctly with grace-period initialization. For example, by introducing
a lock. ;-)

> The other approach is yet another mask @qsmaskofflinenext which the
> kthread will use to clear bits on @qsmaskinitnext.

And here I thought that my current use of only three such masks was
getting a bit ornate. ;-)

> In any case, aside from the above, the below contains a bunch of missing
> WRITE_ONCE()s. Since you read the various @qsmask variables using
> READ_ONCE() you must also consistently update them using WRITE_ONCE(),
> otherwise it's all still buggered.

And I introduced those READ_ONCE() calls an embarrassingly long time ago,
didn't I? But yes, any update needs to use WRITE_ONCE(). I will put
together a patch with your Reported-by. No, wait, I guess I start with
pieces of your patch below. To that end, may I have your Signed-off-by
for the WRITE_ONCE() pieces?

> ---
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 7832dd556490..8713048d5103 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -104,7 +104,6 @@ struct rcu_state sname##_state = { \
> .abbr = sabbr, \
> .exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \
> .exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \
> - .ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
> }
>
> RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
> @@ -209,7 +208,12 @@ EXPORT_SYMBOL_GPL(rcu_get_gp_kthreads_prio);
> */
> unsigned long rcu_rnp_online_cpus(struct rcu_node *rnp)
> {
> - return READ_ONCE(rnp->qsmaskinitnext);
> + /*
> + * For both online and offline we first set/clear @qsmaskinitnext,
> + * and complete by propagating into @qsmaskinit. As long as the bit
> + * remains in either mask, RCU is still online.
> + */
> + return READ_ONCE(rnp->qsmaskinit) | READ_ONCE(rnp->qsmaskinitnext);
> }
>
> /*
> @@ -1928,19 +1932,17 @@ static bool rcu_gp_init(struct rcu_state *rsp)
> */
> rsp->gp_state = RCU_GP_ONOFF;
> rcu_for_each_leaf_node(rsp, rnp) {
> - spin_lock(&rsp->ofl_lock);
> raw_spin_lock_irq_rcu_node(rnp);
> if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
> !rnp->wait_blkd_tasks) {
> /* Nothing to do on this leaf rcu_node structure. */
> raw_spin_unlock_irq_rcu_node(rnp);
> - spin_unlock(&rsp->ofl_lock);
> continue;
> }
>
> /* Record old state, apply changes to ->qsmaskinit field. */
> oldmask = rnp->qsmaskinit;
> - rnp->qsmaskinit = rnp->qsmaskinitnext;
> + WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinitnext);
>
> /* If zero-ness of ->qsmaskinit changed, propagate up tree. */
> if (!oldmask != !rnp->qsmaskinit) {
> @@ -1970,7 +1972,6 @@ static bool rcu_gp_init(struct rcu_state *rsp)
> }
>
> raw_spin_unlock_irq_rcu_node(rnp);
> - spin_unlock(&rsp->ofl_lock);
> }
> rcu_gp_slow(rsp, gp_preinit_delay); /* Races with CPU hotplug. */
>
> @@ -1992,7 +1993,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
> raw_spin_lock_irqsave_rcu_node(rnp, flags);
> rdp = this_cpu_ptr(rsp->rda);
> rcu_preempt_check_blocked_tasks(rsp, rnp);
> - rnp->qsmask = rnp->qsmaskinit;
> + WRITE_ONCE(rnp->qsmask, rnp->qsmaskinit);
> WRITE_ONCE(rnp->gp_seq, rsp->gp_seq);
> if (rnp == rdp->mynode)
> (void)__note_gp_changes(rsp, rnp, rdp);
> @@ -2295,7 +2296,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
> WARN_ON_ONCE(oldmask); /* Any child must be all zeroed! */
> WARN_ON_ONCE(!rcu_is_leaf_node(rnp) &&
> rcu_preempt_blocked_readers_cgp(rnp));
> - rnp->qsmask &= ~mask;
> + WRITE_ONCE(rnp->qsmask, rnp->qsmask & ~mask);
> trace_rcu_quiescent_state_report(rsp->name, rnp->gp_seq,
> mask, rnp->qsmask, rnp->level,
> rnp->grplo, rnp->grphi,
> @@ -2503,7 +2504,7 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
> if (!rnp)
> break;
> raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
> - rnp->qsmaskinit &= ~mask;
> + WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit & ~mask);
> /* Between grace periods, so better already be zero! */
> WARN_ON_ONCE(rnp->qsmask);
> if (rnp->qsmaskinit) {
> @@ -3522,7 +3523,7 @@ static void rcu_init_new_rnp(struct rcu_node *rnp_leaf)
> return;
> raw_spin_lock_rcu_node(rnp); /* Interrupts already disabled. */
> oldmask = rnp->qsmaskinit;
> - rnp->qsmaskinit |= mask;
> + WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit | mask);
> raw_spin_unlock_rcu_node(rnp); /* Interrupts remain disabled. */
> if (oldmask)
> return;
> @@ -3733,7 +3734,7 @@ void rcu_cpu_starting(unsigned int cpu)
> rnp = rdp->mynode;
> mask = rdp->grpmask;
> raw_spin_lock_irqsave_rcu_node(rnp, flags);
> - rnp->qsmaskinitnext |= mask;
> + WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
> oldmask = rnp->expmaskinitnext;
> rnp->expmaskinitnext |= mask;
> oldmask ^= rnp->expmaskinitnext;
> @@ -3768,18 +3769,36 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
>
> /* Remove outgoing CPU from mask in the leaf rcu_node structure. */
> mask = rdp->grpmask;
> - spin_lock(&rsp->ofl_lock);
> raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
> rdp->rcu_ofl_gp_seq = READ_ONCE(rsp->gp_seq);
> rdp->rcu_ofl_gp_flags = READ_ONCE(rsp->gp_flags);
> +
> + /*
> + * First clear @qsmaskinitnext such that we'll not start a new GP
> + * on this outgoing CPU.
> + */
> + WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask);
> if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
> - /* Report quiescent state -before- changing ->qsmaskinitnext! */
> + /*
> + * Report the QS on the outgoing CPU. This will propagate the
> + * cleared bit into @qsmaskinit and @qsmask. We rely on
> + * @qsmaskinit still containing this CPU such that
> + * rcu_rnp_online_cpus() will still consider RCU online.
> + *
> + * This allows us to wake the GP kthread, since wakeups rely on
> + * RCU.
> + */
> + WARN_ON_ONCE(!(rnp->qsmaskinit & mask));
> rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> raw_spin_lock_irqsave_rcu_node(rnp, flags);
> + } else {
> + /*
> + * If there was no QS required, clear @qsmaskinit now to
> + * finalize the offline.
> + */
> + WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit & ~mask);
> }
> - rnp->qsmaskinitnext &= ~mask;
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> - spin_unlock(&rsp->ofl_lock);
> }
>
> /*
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 4e74df768c57..a1528b970257 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -84,19 +84,24 @@ struct rcu_node {
> unsigned long gp_seq; /* Track rsp->rcu_gp_seq. */
> unsigned long gp_seq_needed; /* Track rsp->rcu_gp_seq_needed. */
> unsigned long completedqs; /* All QSes done for this node. */
> - unsigned long qsmask; /* CPUs or groups that need to switch in */
> - /* order for current grace period to proceed.*/
> - /* In leaf rcu_node, each bit corresponds to */
> - /* an rcu_data structure, otherwise, each */
> - /* bit corresponds to a child rcu_node */
> - /* structure. */
> - unsigned long rcu_gp_init_mask; /* Mask of offline CPUs at GP init. */
> +
> + /*
> + * @qsmask - CPUs pending in this GP

Huh. I wasn't aware that docbook/sphinx/whatever knew about this
style of documentation. I will have to check that out in the fulness
of time...

> + * @qsmaskinit - CPUs we started this GP with

- CPUs online at the last GP start

> + * @qsmaskinitnext - CPUs we'll start the next GP with

Only if that GP were to start immediately, of course.

Note that if CPU 0 and CPU 88 come online in that order, it is quite
possible that there will be a grace period that waits on CPU 88 but
not CPU 0. This would happen if CPU 0's rcu_node structure checked
->qsmaskinitnext, CPU 0 came online, CPU 88 came online, and then CPU 88's
rcu_node structure checked ->qsmaskinitnext.

So the order in which CPUs come online and go offline is not necessarily
the order in which successive grace periods start/stop paying attention
to them.

Thanx, Paul

> + *
> + * online: we add the incoming CPU to @qsmaskinitnext which will then
> + * be propagated into @qsmaskinit and @qsmask by starting/joining a GP.
> + *
> + * offline: we remove the CPU from @qsmaskinitnext such that the
> + * outgoing CPU will not be part of a next GP, which will then be
> + * propagated into @qsmaskinit and @qsmask by completing/leaving a GP.
> + */
> + unsigned long qsmask;
> unsigned long qsmaskinit;
> - /* Per-GP initial value for qsmask. */
> - /* Initialized from ->qsmaskinitnext at the */
> - /* beginning of each grace period. */
> unsigned long qsmaskinitnext;
> - /* Online CPUs for next grace period. */
> +
> + unsigned long rcu_gp_init_mask; /* Mask of offline CPUs at GP init. */
> unsigned long expmask; /* CPUs or groups that need to check in */
> /* to allow the current expedited GP */
> /* to complete. */
> @@ -367,10 +372,6 @@ struct rcu_state {
> const char *name; /* Name of structure. */
> char abbr; /* Abbreviated name. */
> struct list_head flavors; /* List of RCU flavors. */
> -
> - spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
> - /* Synchronize offline with */
> - /* GP pre-initialization. */
> };
>
> /* Values for rcu_state structure's gp_flags field. */
>