Re: rcu_read_lock lost its compiler barrier

From: Paul E. McKenney
Date: Tue Jun 04 2019 - 17:18:57 EST

On Mon, Jun 03, 2019 at 01:24:32PM -0700, Linus Torvalds wrote:
> On Mon, Jun 3, 2019 at 12:53 PM Paul E. McKenney <paulmck@xxxxxxxxxxxxx> wrote:
> >
> > I agree that !PREEMPT rcu_read_lock() would not affect compiler code
> > generation, but given that get_user() is a volatile asm, isn't the
> > compiler already forbidden from reordering it with the volatile-casted
> > WRITE_ONCE() access, even if there was nothing at all between them?
> > Or are asms an exception to the rule that volatile executions cannot
> > be reordered?
> Paul, you MAKE NO SENSE.
> What is wrong with you?

Mostly that I didn't check all architectures' definitions of get_user().
Had I done so, I would have seen that not all of the corresponding asms
have the "volatile" keyword. And of course, without that keyword, there
is absolutely nothing preventing the compiler from reordering the asm
with pretty much anything. The only things that would be absolutely
guaranteed to prevent reordering would be things like memory clobbers
(barrier()) or accesses that overlap the asm's input/output list.

Yeah, I know, even with the "volatile" keyword, it is not entirely clear
how much reordering the compiler is allowed to do. I was relying on, which says:



The typical use of extended asm statements is to
manipulate input values to produce output values. However,
your asm statements may also produce side effects. If so,
you may need to use the volatile qualifier to disable
certain optimizations. See Volatile.

But the linked-to "Volatile" section later in that same web page mostly
talks about the compiler's ability to hoist asms out of loops.

> I just showed you an example of where rcu_read_lock() needs to be a
> compiler barrier, and then you make incoherent noises about
> WRITE_ONCE() that do not even exist in that example.

I thought we were discussing this example, but it doesn't matter because
I was missing your point about get_user() and page faults:

get_user(val, ptr)
WRITE_ONCE(state, 1);

But regardless, given that some architectures omit volatile from their
asms implementing get_user(), even an optimistic interpretation of that
part of the GCC documentation would still permit reordering the above.
And again, I was missing your point about get_user() causing page faults
and thus context switches.

> Forget about your READ_ONCE/WRITE_ONCE theories. Herbert already
> showed code that doesn't have those accessors, so reality doesn't
> match your fevered imagination.

I get the feeling that you believe that I want LKMM to be some sort of
final judge and absolute arbiter of what code is legal and not from
a memory-ordering perspective. This is absolutely -not- the case.
The current state of the art, despite the recent impressive progress,
simply cannot reasonably do this. So all I can claim is that LKMM
dispenses advice, hopefully good advice. (It is early days for LKMM's
handling of plain accesses, so some work might be required to deliver
on the "good advice" promise, but we have to start somewhere. Plus it
is progressing nicely.)

The places where long-standing RCU patterns require rcu_dereference()
and rcu_assign_pointer() do require some attention to avoid compiler
optimizations, and {READ,WRITE}_ONCE() is one way of addressing this.
But not the only way, nor always the best way. For example, some
fastpaths might need the optimizations that {READ,WRITE}_ONCE()
suppresses. Therefore, Linux kernel hackers have a number of other
ways of paying attention. For example, accesses might be constrained
via barrier() and friends. For another example, some developers might
check assembly output (hopefully scripted somehow).

Again, the Linux-kernel memory model dispenses advice, not absolutes.
Furthermore, the way it dispenses advice is currently a bit limited.
It can currently say that it is nervous about lack of {READ,WRITE}_ONCE(),
as in "Flag data-race", but it would be difficult to make it recommend
the other options in an intelligent way. So we should interpret "Flag
data-race" as LKMM saying "I am nervous about your unmarked accesses"
rather than "You absolutely must call {READ,WRITE}_ONCE() more often!!!"
Again, advice, not absolutes.

So the idea is that you add and remove {READ,WRITE}_ONCE() to/from the
-litmus- -tests- to determine which accesses LKMM is nervous about.
But that doesn't necessarily mean that {READ,WRITE}_ONCE() goes into
the corresponding places in the Linux kernel.

Does that help, or am I still confused?

> And sometimes it's not even possible, since you can't do a bitfield
> access, for example, with READ_ONCE().

Ah, good point. So the Linux kernel uses bitfields to communicate
between mainline and interrupt handlers. New one on me. :-/

> > We can of course put them back in,
> Stop the craziness. It's not "we can". It is a "we will".
> So I will add that barrier, and you need to stop arguing against it
> based on specious theoretical arguments that do not match reality. And
> we will not ever remove that barrier again. Herbert already pointed to
> me having to do this once before in commit 386afc91144b ("spinlocks
> and preemption points need to be at least compiler barriers"), and
> rcu_read_lock() clearly has at a minimum that same preemption point
> issue.

And the lack of "volatile" allows get_user() to migrate page faults
(and thus context switches) into RCU read-side critical sections
in CONFIG_PREEMPT=n. Yes, this would be very bad.

OK, I finally got it, so please accept my apologies for my earlier

I don't yet see a commit from you, so I queued the one below locally
and started testing.

Thanx, Paul


commit 9b4766c5523efb8d3d52b2ba2a29fd69cdfc65bb
Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
Date: Tue Jun 4 14:05:52 2019 -0700

rcu: Restore barrier() to rcu_read_lock() and rcu_read_unlock()

Commit bb73c52bad36 ("rcu: Don't disable preemption for Tiny and Tree
RCU readers") removed the barrier() calls from rcu_read_lock() and
rcu_write_lock() in CONFIG_PREEMPT=n&&CONFIG_PREEMPT_COUNT=n kernels.
Within RCU, this commit was OK, but it failed to account for things like
get_user() that can pagefault and that can be reordered by the compiler.
Lack of the barrier() calls in rcu_read_lock() and rcu_read_unlock()
can cause these page faults to migrate into RCU read-side critical
sections, which in CONFIG_PREEMPT=n kernels could result in too-short
grace periods and arbitrary misbehavior. Please see commit 386afc91144b
("spinlocks and preemption points need to be at least compiler barriers")
for more details.

This commit therefore restores the barrier() call to both rcu_read_lock()
and rcu_read_unlock(). It also removes them from places in the RCU update
machinery that used to need compensatory barrier() calls, effectively
reverting commit bb73c52bad36 ("rcu: Don't disable preemption for Tiny
and Tree RCU readers").

Reported-by: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
Reported-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0c9b92799abc..8f7167478c1d 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -56,14 +56,12 @@ void __rcu_read_unlock(void);

static inline void __rcu_read_lock(void)
- preempt_disable();
+ preempt_disable();

static inline void __rcu_read_unlock(void)
- preempt_enable();
+ preempt_enable();

static inline int rcu_preempt_depth(void)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 3f52d8438e0f..841060fce33c 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -288,7 +288,6 @@ void rcu_note_context_switch(bool preempt)
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
struct rcu_node *rnp;

- barrier(); /* Avoid RCU read-side critical sections leaking down. */
trace_rcu_utilization(TPS("Start context switch"));
WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0);
@@ -340,7 +339,6 @@ void rcu_note_context_switch(bool preempt)
if (rdp->exp_deferred_qs)
trace_rcu_utilization(TPS("End context switch"));
- barrier(); /* Avoid RCU read-side critical sections leaking up. */

@@ -828,11 +826,6 @@ static void rcu_qs(void)
* dyntick-idle quiescent state visible to other CPUs, which will in
* some cases serve for expedited as well as normal grace periods.
* Either way, register a lightweight quiescent state.
- *
- * The barrier() calls are redundant in the common case when this is
- * called externally, but just in case this is called from within this
- * file.
- *
void rcu_all_qs(void)
@@ -847,14 +840,12 @@ void rcu_all_qs(void)
this_cpu_write(rcu_data.rcu_urgent_qs, false);
- barrier(); /* Avoid RCU read-side critical sections leaking down. */
if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) {
- barrier(); /* Avoid RCU read-side critical sections leaking up. */
@@ -864,7 +855,6 @@ EXPORT_SYMBOL_GPL(rcu_all_qs);
void rcu_note_context_switch(bool preempt)
- barrier(); /* Avoid RCU read-side critical sections leaking down. */
trace_rcu_utilization(TPS("Start context switch"));
/* Load rcu_urgent_qs before other flags. */
@@ -877,7 +867,6 @@ void rcu_note_context_switch(bool preempt)
trace_rcu_utilization(TPS("End context switch"));
- barrier(); /* Avoid RCU read-side critical sections leaking up. */