Re: [PATCH tip/core/rcu 02/18] rcu: Move rcu_report_exp_rnp() to allow consolidation

From: Paul E. McKenney
Date: Wed Oct 07 2015 - 12:49:10 EST


On Wed, Oct 07, 2015 at 04:40:24PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 07, 2015 at 07:33:25AM -0700, Paul E. McKenney wrote:
> > > I'm sure you know what that means, but I've no clue ;-) That is, I
> > > wouldn't know where to start looking in the RCU implementation to verify
> > > the barrier is either needed or sufficient. Unless you mean _everywhere_
> > > :-)
> >
> > Pretty much everywhere.
> >
> > Let's take the usual RCU removal pattern as an example:
> >
> > void f1(struct foo *p)
> > {
> > list_del_rcu(p);
> > synchronize_rcu_expedited();
> > kfree(p);
> > }
> >
> > void f2(void)
> > {
> > struct foo *p;
> >
> > list_for_each_entry_rcu(p, &my_head, next)
> > do_something_with(p);
> > }
> >
> > So the synchronize_rcu_expedited() acts as an extremely heavyweight
> > memory barrier that pairs with the rcu_dereference() inside of
> > list_for_each_entry_rcu(). Easy enough, right?
> >
> > But what exactly within synchronize_rcu_expedited() provides the
> > ordering? The answer is a web of lock-based critical sections and
> > explicit memory barriers, with the one you called out as needing
> > a comment being one of them.
>
> Right, but seeing there's possible implementations of sync_rcu(_exp)*()
> that do not have the whole rcu_node tree like thing, there's more to
> this particular barrier than the semantics of sync_rcu().
>
> Some implementation choice requires this barrier upgrade -- and in
> another email I suggest its the whole tree thing, we need to firmly
> establish the state of one level before propagating the state up etc.
>
> Now I'm not entirely sure this is fully correct, but its the best I
> could come up.

It is pretty close. Ignoring dyntick idle for the moment, things
go (very) roughly like this:

o The RCU grace-period kthread notices that a new grace period
is needed. It initializes the tree, which includes acquiring
every rcu_node structure's ->lock.

o CPU A notices that there is a new grace period. It acquires
the ->lock of its leaf rcu_node structure, which forces full
ordering against the grace-period kthread.

o Some time later, that CPU A realizes that it has passed
through a quiescent state, and again acquires its leaf rcu_node
structure's ->lock, again enforcing full ordering, but this
time against all CPUs corresponding to this same leaf rcu_node
structure that previously noticed quiescent states for this
same grace period. Also against all prior readers on this
same CPU.

o Some time later, CPU B (corresponding to that same leaf
rcu_node structure) is the last of that leaf's group of CPUs
to notice a quiescent state. It has also acquired that leaf's
->lock, again forcing ordering against its prior RCU read-side
critical sections, but also against all the prior RCU
read-side critical sections of all other CPUs corresponding
to this same leaf.

o CPU B therefore moves up the tree, acquiring the parent
rcu_node structures' ->lock. In so doing, it forces full
ordering against all prior RCU read-side critical sections
of all CPUs corresponding to all leaf rcu_node structures
subordinate to the current (non-leaf) rcu_node structure.

o And so on, up the tree.

o When CPU C reaches the root of the tree, and realizes that
it is the last CPU to report a quiescent state for the
current grace period, its acquisition of the root rcu_node
structure's ->lock has forced full ordering against all
RCU read-side critical sections that started before this
grace period -- on all CPUs.

CPU C therefore awakens the grace-period kthread.

o When the grace-period kthread wakes up, it does cleanup,
which (you guessed it!) requires acquiring the ->lock of
each rcu_node structure. This not only forces full ordering
against each pre-existing RCU read-side critical section,
it also sets up things so that...

o When CPU D notices that the grace period ended, it does so
while holding its leaf rcu_node structure's ->lock. This
forces full ordering against all relevant RCU read-side
critical sections. This ordering prevails when CPU D later
starts invoking RCU callbacks.

o Just for fun, suppose that one of those callbacks does an
"smp_store_release(&leak_gp, 1)". Suppose further that some
CPU E that is not yet aware that the grace period is finished
does an "r1 = smp_load_acquire(&lead_gp)" and gets 1. Even
if CPU E was the very first CPU to report a quiescent state
for the grace period, and even if CPU E has not executed any
sort of ordering operations since, CPU E's subsequent code is
-still- guaranteed to be fully ordered after each and every
RCU read-side critical section that started before the grace
period.

Hey, you asked!!! ;-)

Again, this is a cartoon-like view of the ordering that leaves out a
lot of details, but it should get across the gist of the ordering.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/