Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline

From: Paul E. McKenney
Date: Tue Jun 26 2018 - 19:38:11 EST


On Tue, Jun 26, 2018 at 10:32:25PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 01:26:15PM -0700, Paul E. McKenney wrote:
> > commit 2e5b2ff4047b138d6b56e4e3ba91bc47503cdebe
> > Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> > Date: Fri May 25 19:23:09 2018 -0700
> >
> > rcu: Fix grace-period hangs due to race with CPU offline
> >
> > Without special fail-safe quiescent-state-propagation checks, grace-period
> > hangs can result from the following scenario:
> >
> > 1. CPU 1 goes offline.
> >
> > 2. Because CPU 1 is the only CPU in the system blocking the current
> > grace period, the grace period ends as soon as
> > rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp()
> > returns.
>
> My current code doesn't have that call... So this is a new problem
> earlier in this series.
>
> > 3. At this point, the leaf rcu_node structure's ->lock is no longer
> > held: rcu_report_qs_rnp() has released it, as it must in order
> > to awaken the RCU grace-period kthread.
> >
> > 4. At this point, that same leaf rcu_node structure's ->qsmaskinitnext
> > field still records CPU 1 as being online. This is absolutely
> > necessary because the scheduler uses RCU (in this case on the
> > wake-up path while awakening RCU's grace-period kthread), and
> > ->qsmaskinitnext contains RCU's idea as to which CPUs are online.
> > Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's
> > bit from ->qsmaskinitnext would result in a lockdep-RCU splat
> > due to RCU being used from an offline CPU.
>
> Argh.. so it's your own wakeup!

That it is. The outgoing CPU might need to wake up RCU's grace-period
kthread, for example, if that CPU was the last thing holding up the
current grace period.

> This all still smells really bad. But let me try and figure out where
> you introduced the problem.

Well, I can't say that it is the code that I am the most proud of...

On the other hand, there are a number of constraints on this code:

o The ->qsmask bits indicate which CPUs the current grace period
is still waiting on.

o The ->qsmaskinit bits indicate which CPUs the current grace
period was waiting on at grace-period initialization time.
In fact, the ->qsmask bits are initialized by copying from
the corresponding ->qsmaskinit bits.

o The ->qsmaskinitnext bits indicate which CPUs are online at
a given time, from an RCU perspective. RCU will consider
an incoming CPU to be online before cpu_online_mask does,
and RCU will also consider an outgoing CPU to be online after
cpu_online_mask says that it is gone.

o Only the leaf rcu_node structures' ->qsmaskinitnext bits have
meaning. Thus, although the ->qsmaskinit bits are initialized
for a given grace period based on changes in the values of
the ->qsmaskinitnext bits since the beginning of the last
grace period, for the non-leaf ->qsmaskinit bits, this cannot
be a simple copy operation. (Instead, changes are pushed up
the rcu_node tree.)

Why not instead make the CPU-hotplug operations push the
changes up the tree? Because this could race with grace-period
initialization, in which case it would result in inconsistent
values for the ->qsmaskinit bits, and thus of the ->qsmask bits.
These inconsistent values can result in too-short grace periods
on the one hand and grace-period hangs on the other.

o The purpose of these three bitmasks is to decouple RCU's
grace-period initialization from CPU-hotplug operations.
The reason for this decoupling is that various pieces of the
CPU-hotplug code wait on RCU grace periods, which means that
RCU grace-period initialization cannot block CPU hotplug,
at least not without risk of deadlock.

o The initial state of all the ->qsmask bits must be consistent (at
least in the case where no CPU tries to report a quiescent state
until all these bits are fully initialized). That is to say,
if the initial value of a given ->qsmask is all zero (and there
are no preempted tasks referenced by the ->gp_tasks pointer,
see below), then the initial value of the corresponding bit of
the parent rcu_node structure must also be zero. Similarly, if
the initial value of a given ->qsmask has at least one non-zero
bit (or there is at least one task referenced by the ->gp_tasks
pointer), then the initial value of the corresponding bit of
the parent rcu_node structure must also be non-zero. Of course,

Violation of this rule was the impetus for several of the
patches under discussion.

o Only the CPU-hotplug code is allowed to change the values of
the ->qsmaskinitnext bits. The grace-period initialization code
reads these bits, but so does code checking for offline CPUs.
The rcu_node structure's ->lock must be held when changing
these bits.

o Only the grace-period initialization code is allowed to access the
->qsmaskinit bits (give or take debug prints). In theory, no lock
is required when updating these bits, but in practice the rcu_node
structure's ->lock is held due to the need to hold either the
->qsmask or the ->qsmaskinitnext bits still while carrying out
the access.

o Only the grace-period initialization code is allowed to change
the values of the ->qsmask bits from zero to one. Only something
reporting a quiescent state is allowed to change the values of
the corresponding ->qsmask bits from one to zero. If the
->qsmask field of the root rcu_node structure becomes zero,
the RCU grace-period kthread should be awakened (one execption
being when that kthread cleared the last bit, and another where
the root is also a leaf and there are tasks referenced by the
->gp_tasks pointer, again, see below).

o Grace-period initialization first scans the leaf rcu_node
structures for CPU-hotplug transitions (locking each such
structure along the way and updating their ->qsmaskinit field),
then does a breadth-first copy of ->qsmaskinit to ->qsmask for
each rcu_node structure. CPUs can come and go at any point
in this process. In fact, one CPU can go and another
(or even the same CPU) can come back during this part of
grace-period initialization.

o Leaf rcu_node structures must concern themselves with tasks
preempted within an RCU-preempt read-side critical section as well
as CPUs passing through quiescent states. These preempted tasks
are placed on the corresponding structure's ->blkd_tasks list.
It is possible that a grace period might start such that a given
leaf must wait on a preempted task (non-NULL ->gp_tasks pointer)
but no online CPUs. In addition, a CPU could come online
while this state was in effect. The boolean ->wait_blkd_tasks
field helps keep all this straight, allowing that leaf rcu_node
structure to be taken out of circulation once all its CPUs are
offline -and- its ->blkd_tasks list has become empty.

o As soon as a ->qsmask bit is set in a leaf rcu_node structure,
the CPU corresponding to that bit might immediately start trying
to report a quiescent state. This means that the ->qsmask bits
must be initialized from the root down, in a breadth-first
fashion. Otherwise, a quiescent-state report could cause the
->qsmask bits to become inconsistent (if the report outran
the bottom-to-top initialization).

o In the past, the RCU grace-period kthread's force-quiescent-state
scan would catch CPU-offline races, but the goal is to get rid
of that fail-safe in order to reduce lock contention. Therefore,
if a given leaf rcu_node structure's ->qsmask field has a bit set,
then that CPU needs to be online at least at the very beginning
of the grace period.

The options I have considered are as follows:

1. Stick with the no-failsafe approach, adding the lock as shown
in this patch. (I obviously prefer this approach.)

2. Stick with the no-failsafe approach, but rely on RCU's grace-period
kthread to wake up later due to its timed wait during the
force-quiescent-state process. This would be a bit obnoxious,
as it requires passing a don't-wake flag (or some such) up the
quiescent-state reporting mechanism. It would also needlessly
delay grace-period ends, especially on large systems (RCU scales
up the FQS delay on larger systems to maintain limited CPU
consumption per unit time).

3. Stick with the no-failsafe approach, but have the quiescent-state
reporting code hand back a value indicating that a wakeup is needed.
Also a bit obnoxious, as this value would need to be threaded up
the reporting code's return path. Simple in theory, but a bit
of an ugly change, especially for the many places in the code that
currently expect quiescent-state reporting to be an unconditional
fire-and-forget operation.

4. Re-introduce the old fail-safe code, and don't report the
quiescent state at CPU-offline time, relying on the fail-safe
code to do so. This also potentially introduces delays and can
add extra FQS scans, which in turn increases lock contention.

So what did you have in mind?

Thanx, Paul