Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline

From: Peter Zijlstra
Date: Tue Jun 26 2018 - 16:32:58 EST


On Tue, Jun 26, 2018 at 01:26:15PM -0700, Paul E. McKenney wrote:
> commit 2e5b2ff4047b138d6b56e4e3ba91bc47503cdebe
> Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> Date: Fri May 25 19:23:09 2018 -0700
>
> rcu: Fix grace-period hangs due to race with CPU offline
>
> Without special fail-safe quiescent-state-propagation checks, grace-period
> hangs can result from the following scenario:
>
> 1. CPU 1 goes offline.
>
> 2. Because CPU 1 is the only CPU in the system blocking the current
> grace period, the grace period ends as soon as
> rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp()
> returns.

My current code doesn't have that call... So this is a new problem
earlier in this series.

> 3. At this point, the leaf rcu_node structure's ->lock is no longer
> held: rcu_report_qs_rnp() has released it, as it must in order
> to awaken the RCU grace-period kthread.
>
> 4. At this point, that same leaf rcu_node structure's ->qsmaskinitnext
> field still records CPU 1 as being online. This is absolutely
> necessary because the scheduler uses RCU (in this case on the
> wake-up path while awakening RCU's grace-period kthread), and
> ->qsmaskinitnext contains RCU's idea as to which CPUs are online.
> Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's
> bit from ->qsmaskinitnext would result in a lockdep-RCU splat
> due to RCU being used from an offline CPU.

Argh.. so it's your own wakeup!

This all still smells really bad. But let me try and figure out where
you introduced the problem.