Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

From: Paul E. McKenney
Date: Thu Oct 05 2017 - 12:19:20 EST


On Thu, Oct 05, 2017 at 05:39:13PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2017 at 07:55:13AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 05, 2017 at 11:41:14AM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 04, 2017 at 02:29:27PM -0700, Paul E. McKenney wrote:
> > > > Consider the following admittedly improbable sequence of events:
> > > >
> > > > o RCU is initially idle.
> > > >
> > > > o Task A on CPU 0 executes rcu_read_lock().
> > > >
> > > > o Task B on CPU 1 executes synchronize_rcu(), which must
> > > > wait on Task A:
> > > >
> > > > o Task B registers the callback, which starts a new
> > > > grace period, awakening the grace-period kthread
> > > > on CPU 3, which immediately starts a new grace period.
> > > >
> > > > o Task B migrates to CPU 2, which provides a quiescent
> > > > state for both CPUs 1 and 2.
> > > >
> > > > o Both CPUs 1 and 2 take scheduling-clock interrupts,
> > > > and both invoke RCU_SOFTIRQ, both thus learning of the
> > > > new grace period.
> > > >
> > > > o Task B is delayed, perhaps by vCPU preemption on CPU 2.
> > > >
> > > > o CPUs 2 and 3 pass through quiescent states, which are reported
> > > > to core RCU.
> > > >
> > > > o Task B is resumed just long enough to be migrated to CPU 3,
> > > > and then is once again delayed.
> > > >
> > > > o Task A executes rcu_read_unlock(), exiting its RCU read-side
> > > > critical section.
> > > >
> > > > o CPU 0 passes through a quiescent sate, which is reported to
> > > > core RCU. Only CPU 1 continues to block the grace period.
> > > >
> > > > o CPU 1 passes through a quiescent state, which is reported to
> > > > core RCU. This ends the grace period, and CPU 1 therefore
> > > > invokes its callbacks, one of which awakens Task B via
> > > > complete().
> > > >
> > > > o Task B resumes (still on CPU 3) and starts executing
> > > > wait_for_completion(), which sees that the completion has
> > > > already completed, and thus does not block. It returns from
> > > > the synchronize_rcu() without any ordering against the
> > > > end of Task A's RCU read-side critical section.
> > > >
> > > > It can therefore mess up Task A's RCU read-side critical section,
> > > > in theory, anyway.
> > >
> > > I'm not sure I follow, at the very least the wait_for_completion() does
> > > an ACQUIRE such that it observes the state prior to the RELEASE as done
> > > by complete(), no?
> >
> > Your point being that both wait_for_completion() and complete() acquire
> > and release the same lock? (Yes, I suspect that I was confusing this
> > with wait_event() and wake_up(), just so you know.)
>
> Well, fundamentally complete()/wait_for_completion() is a message-pass
> and they include a RELEASE/ACQUIRE pair for causal reasons.
>
> Per the implementation they use a spinlock, but any implementation needs
> to provide at least that RELEASE/ACQUIRE pair.
>
> > > And is not CPU0's QS reporting ordered against that complete()?
> >
> > Mumble mumble mumble powerpc mumble mumble mumble...
> >
> > OK, I will make this new memory barrier only execute for powerpc.
> >
> > Or am I missing something else here?
>
> So I'm not entirely clear on the required semantics here; why do we need
> a full mb? I'm thinking CPU0's QS propagating through the tree and
> arriving at the root node is a multi-copy-atomic / transitive thing and
> all CPUs will agree the system QS has ended, right?
>
> Whichever CPU establishes the system QS does complete() and the
> wait_for_completion() then has the weak-transitive causal relation to
> that, ensuring that -- in the above example -- CPU3 must be _after_
> CPU0's rcu_read_unlock().

Yes, the ordering does need to be visible to uninvolved CPUs, so
release-acquire is not necessarily strong enough.

My current thought is like this:

if (IS_ENABLED(CONFIG_ARCH_WEAK_RELEASE_ACQUIRE))
smp_mb();

Thoughts?

Thanx, Paul