Re: [PATCH tip/core/rcu 1/9] rcu: Provide GP ordering in face of migrations and delays

From: Paul E. McKenney
Date: Sat Oct 07 2017 - 14:32:27 EST

On Sat, Oct 07, 2017 at 11:29:19AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 06, 2017 at 08:31:05PM -0700, Paul E. McKenney wrote:
> > > > OK, I will bite... What do the smp_store_release() and the
> > > > smp_load_acquire() correspond to? I see just plain locking in
> > > > wait_for_completion() and complete().
> > >
> > > They reflect the concept of complete() / wait_for_completion().
> > > Fundamentally all it needs to do is pass the message of 'completion'.
> > >
> > > That is, if we were to go optimize our completion implementation, it
> > > would be impossible to be weaker than this and still correct.
> >
> > OK, though the model does not provide spinlocks, and there can be

Sigh. s/not//. The current model -does- provide spinlocks, though
they are a bit new. I don't know of any breakage, but I am paranoid
enough so that where feasible I double-check against xchg_acquire()
and store_release().

> > differences in behavior between spinlocks and release-acquire.
> > But yes, in this case, it works.
> Sure; but the fundamental property here is that if we observe the
> complete() we must also observe everything that went before. The exact
> means of implementing that is irrelevant.

Agreed, and that also tends to speed up the running of the model on
the litmus test, so this sort of abstraction is a very good thing for
multiple reasons.

So why did I use spinlocks? Because the model was small and fast enough,
and using the spinlocks meant that I didn't need to take time to worry
about the code's intent.

But if you are saying that it would be good to have wait_for_completion()
and complete() directly modeled at some point, no argument. In addition,
I hope that the memory model is applied to other tools that analyze kernel

> > > > So I dropped that patch yesterday. The main thing I was missing was
> > > > that there is no ordering-free fastpath in wait_for_completion() and
> > > > complete(): Each unconditionally acquires the lock. So the smp_mb()
> > > > that I was trying to add doesn't need to be there.
> > >
> > > Going by the above, it never needs to be there, even if there was a
> > > lock-free fast-path.
> >
> > Given that wait_for_completion()/complete() both acquire the same lock,
> > yes, and agreed, if it were lockless but provided the release and
> > acquire ordering, then yes.
> I'm not sure I got the point across; so I'll try once more. Without
> providing this ordering the completion would be fundamentally broken. It
> _must_ provide this ordering.

OK, I now understand what you are getting at, and I do very much like
that guarantee.

> > But if it was instead structured like
> > wait_event()/wake_up(), there would be ordering only if the caller
> > supplied it.
> Right, wait_event()/wake_up() are different in that the 'condition'
> variable is external to the abstraction and thus it cannot help.
> All wait_event()/wake_up() can guarantee is that IFF it does a wakeup,
> the woken thread will observe the prior state of the waker. But given
> the actual condition is external and we might not hit the actual sleep
> case, there is no guarantees.


> > All that aside, paring the ordering down to the bare minimum is not
> > always the right approach.
> Why not? In what sort of cases does it go wobbly?

For one, when it conflicts with maintainability. For example, it would
probably be OK for some of RCU's rcu_node ->lock acquisitions to skip the
smp_mb__after_unlock_lock() invocations. But those are slowpaths, and the
small speedup on only one architecture is just not worth the added pain.
Especially given the nice wrapper functions that you provided.

But of course if this were instead (say) rcu_read_lock() or common-case
rcu_read_unlock(), I would be willing to undergo much more pain. On the
other hand, for that exact reason, that common-case code path doesn't
acquire locks in the first place. ;-)

Thanx, Paul