Re: rcu_read_lock lost its compiler barrier

From: Paul E. McKenney
Date: Mon Jun 03 2019 - 05:31:40 EST


On Mon, Jun 03, 2019 at 11:03:24AM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 05:06:17PM -0700, Paul E. McKenney wrote:
> >
> > Please note that preemptible Tree RCU has lacked the compiler barrier on
> > all but the outermost rcu_read_unlock() for years before Boqun's patch.
>
> Actually this is not true. Boqun's patch (commit bb73c52bad36) does
> not add a barrier() to __rcu_read_lock. In fact I dug into the git
> history and this compiler barrier() has existed in preemptible tree
> RCU since the very start in 2009:

I said rcu_read_unlock() and you said __rcu_read_lock().

> : commit f41d911f8c49a5d65c86504c19e8204bb605c4fd
> : Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> : Date: Sat Aug 22 13:56:52 2009 -0700
> :
> : rcu: Merge preemptable-RCU functionality into hierarchical RCU
> :
> : +/*
> : + * Tree-preemptable RCU implementation for rcu_read_lock().
> : + * Just increment ->rcu_read_lock_nesting, shared state will be updated
> : + * if we block.
> : + */
> : +void __rcu_read_lock(void)
> : +{
> : + ACCESS_ONCE(current->rcu_read_lock_nesting)++;
> : + barrier(); /* needed if we ever invoke rcu_read_lock in rcutree.c */
> : +}
> : +EXPORT_SYMBOL_GPL(__rcu_read_lock);

Thank you for finding this! This particular version does have an
unconditional barrier() in __rcu_read_unlock(), for whatever that
is worth:

+void __rcu_read_unlock(void)
+{
+ struct task_struct *t = current;
+
+ barrier(); /* needed if we ever invoke rcu_read_unlock in rcutree.c */
+ if (--ACCESS_ONCE(t->rcu_read_lock_nesting) == 0 &&
+ unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
+ rcu_read_unlock_special(t);
+}

I would not have seen the point of a compiler barrier in the non-outermost
__rcu_read_unlock(), since the completion of an inner __rcu_read_unlock()
does not permit the grace period to complete.

> However, you are correct that in the non-preempt tree RCU case,
> the compiler barrier in __rcu_read_lock was not always present.
> In fact it was added by:
>
> : commit 386afc91144b36b42117b0092893f15bc8798a80
> : Author: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> : Date: Tue Apr 9 10:48:33 2013 -0700
> :
> : spinlocks and preemption points need to be at least compiler barriers
>
> I suspect this is what prompted you to remove it in 2015.

If I remember correctly, it was pointed out to me that in !PREEMPT kernels,
the compiler barrier in the preempt_disable() invoked in rcu_read_lock()
(and similar on the rcu_read_unlock() side) wasn't helping anything,

> > I do not believe that reverting that patch will help you at all.
> >
> > But who knows? So please point me at the full code body that was being
> > debated earlier on this thread. It will no doubt take me quite a while to
> > dig through it, given my being on the road for the next couple of weeks,
> > but so it goes.
>
> Please refer to my response to Linus for the code in question.
>
> In any case, I am now even more certain that compiler barriers are
> not needed in the code in question. The reasoning is quite simple.
> If you need those compiler barriers then you surely need real memory
> barriers.

OK, we are in agreement on that point, then!

> Vice versa, if real memory barriers are already present thanks to
> RCU, then you don't need those compiler barriers.

For users of RCU, this seems reasonable.

On the other hand, the compiler barriers in PREEMPT Tree RCU's outermost
__rcu_read_lock() and __rcu_read_unlock() invocations really are needed
by RCU internals. This is because RCU uses of interrupt handlers that
access per-task and per-CPU variables, and these need to be able to
sense the edges of the nested set of RCU read-side critical sections.
It is OK for these interrupt handlers to think that the critical section
is larger than it really is, but fatal for them to think that the critical
sections are smaller than they really are.

> In fact this calls into question the use of READ_ONCE/WRITE_ONCE in
> RCU primitives such as rcu_dereference and rcu_assign_pointer.

No, these are -not- called into question, or if they are, the question
gets quickly answered it a way that supports current Linux-kernel code.
As mentioned in earlier emails, the traditional uses of RCU that involve
rcu_dereference(), rcu_assign_pointer(), and synchronize_rcu() all work
just fine.

In fact, from what I can see, the issue stems from having developed
intuitions from working with the traditional rcu_dereference(),
rcu_assign_pointer(), and synchronize_rcu() linked-structure use cases,
and then attempting to apply these intuition to use cases that have
neither rcu_dereference() nor rcu_assign_pointer(). Don't get me wrong,
it is only natural to try to extend your intuitions to something that
admittedly looks pretty similar to the traditional use cases. But this
is one of those cases where "pretty similar" might not be good enough.

> IIRC
> when RCU was first added to the Linux kernel we did not have compiler
> barriers in rcu_dereference and rcu_assign_pointer. They were added
> later on.

>From what I can see, rcu_dereference() still does not have a compiler
barrier. Please note that the pair of barrier() calls in READ_ONCE()
only apply when READ_ONCE()ing something larger than the machine can load.
And if your platform cannot load and store pointers with a single access,
the Linux kernel isn't going to do very well regardless. Ditto for
WRITE_ONCE().

> As compiler barriers per se are useless, these are surely meant to
> be coupled with the memory barriers provided by RCU grace periods
> and synchronize_rcu. But then those real memory barriers would have
> compiler barriers too. So why do we need the compiler barriers in
> rcu_dereference and rcu_assign_pointer?

In rcu_dereference(), RCU does not need them. They are instead
inherited from READ_ONCE() for when it is used on a data structure too
big for any single load instruction available on the system in question.
These barrier() calls are in a case that rcu_dereference() had better
not be using -- after all, using them would mean that the hardware didn't
have a load instruction big enough to handle a pointer.

In rcu_assign_pointer(), RCU just needs this to act like a release
store, that is, the store itself must not be reordered with any earlier
memory accesses. The Linux kernel's smp_store_release() currently
over-approximates this using a barrier() or equivalent inline-assembly
directive, which enforces compiler ordering for not only the release
store, but also far all memory accesses following the release store.
Obviously, barrier is not enough for weakly ordered systems, which
must also emit an appropriate memory-barrier instruction (or a special
load instruction for architectures like ARMv8 providing such a thing).

The compiler barriers in __rcu_read_lock() and __rcu_read_unlock() are
there so that preemptible Tree RCU can use its various tricks to make
readers perform and scale well. Read-side state is confined to the CPU
and/or task in the common case, thus avoiding heavy synchronization
overhead in the common case (or, in the case of !PREEMPT RCU, thus
avoiding -any- synchronization overhead in the common case). For example,
the compiler barriers ensure that RCU's scheduler-clock code and softirq
code can trust per-CPU/task state indicating whether or not there is an
RCU read-side critical section in effect.

Does that help? Or am I missing your point?

Thanx, Paul