Re: rcu_read_lock lost its compiler barrier
From: Paul E. McKenney
Date: Sun Jun 02 2019 - 23:52:14 EST
On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 01:54:12PM -0700, Linus Torvalds wrote:
> > On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > You can't then go and decide to remove the compiler barrier! To do
> > > that you'd need to audit every single use of rcu_read_lock in the
> > > kernel to ensure that they're not depending on the compiler barrier.
> >
> > What's the possible case where it would matter when there is no preemption?
>
> The case we were discussing is from net/ipv4/inet_fragment.c from
> the net-next tree:
>
> void fqdir_exit(struct fqdir *fqdir)
> {
> ...
> fqdir->dead = true;
>
> /* call_rcu is supposed to provide memory barrier semantics,
> * separating the setting of fqdir->dead with the destruction
> * work. This implicit barrier is paired with inet_frag_kill().
> */
>
> INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn);
> queue_rcu_work(system_wq, &fqdir->destroy_rwork);
> }
>
> and
>
> void inet_frag_kill(struct inet_frag_queue *fq)
> {
> ...
> rcu_read_lock();
> /* The RCU read lock provides a memory barrier
> * guaranteeing that if fqdir->dead is false then
> * the hash table destruction will not start until
> * after we unlock. Paired with inet_frags_exit_net().
> */
> if (!fqdir->dead) {
> rhashtable_remove_fast(&fqdir->rhashtable, &fq->node,
> fqdir->f->rhash_params);
> ...
> }
> ...
> rcu_read_unlock();
> ...
> }
>
> I simplified this to
>
> Initial values:
>
> a = 0
> b = 0
>
> CPU1 CPU2
> ---- ----
> a = 1 rcu_read_lock
> synchronize_rcu if (a == 0)
> b = 2 b = 1
> rcu_read_unlock
>
> On exit we want this to be true:
> b == 2
>
> Now what Paul was telling me is that unless every memory operation
> is done with READ_ONCE/WRITE_ONCE then his memory model shows that
> the exit constraint won't hold.
But please note that the plain-variable portion of the memory model is
very new and likely still has a bug or two. In fact, see below.
> IOW, we need
>
> CPU1 CPU2
> ---- ----
> WRITE_ONCE(a, 1) rcu_read_lock
> synchronize_rcu if (READ_ONCE(a) == 0)
> WRITE_ONCE(b, 2) WRITE_ONCE(b, 1)
> rcu_read_unlock
>
> Now I think this bullshit because if we really needed these compiler
> barriers then we surely would need real memory barriers to go with
> them.
On the one hand, you have no code before your rcu_read_lock() and also
1no code after your rcu_read_unlock(). So in this particular example,
adding compiler barriers to these guys won't help you.
On the other hand, on CPU 1's write to "b", I agree with you and disagree
with the model, though perhaps my partners in LKMM crime will show me the
error of my ways on this point. On CPU 2's write to "b", I can see the
memory model's point, but getting there requires some gymnastics on the
part of both the compiler and the CPU. The WRITE_ONCE() and READ_ONCE()
for "a" is the normal requirement for variables that are concurrently
loaded and stored.
Please note that garden-variety uses of RCU have similar requirements,
namely the rcu_assign_pointer() on the one side and the rcu_dereference()
on the other. Your use case allows rcu_assign_pointer() to be weakened
to WRITE_ONCE() and rcu_dereference() to be weakened to READ_ONCE()
(not that this last is all that much of a weakening these days).
> In fact, the sole purpose of the RCU mechanism is to provide those
> memory barriers. Quoting from
> Documentation/RCU/Design/Requirements/Requirements.html:
>
> <li> Each CPU that has an RCU read-side critical section that
> begins before <tt>synchronize_rcu()</tt> starts is
> guaranteed to execute a full memory barrier between the time
> that the RCU read-side critical section ends and the time that
> <tt>synchronize_rcu()</tt> returns.
> Without this guarantee, a pre-existing RCU read-side critical section
> might hold a reference to the newly removed <tt>struct foo</tt>
> after the <tt>kfree()</tt> on line 14 of
> <tt>remove_gp_synchronous()</tt>.
> <li> Each CPU that has an RCU read-side critical section that ends
> after <tt>synchronize_rcu()</tt> returns is guaranteed
> to execute a full memory barrier between the time that
> <tt>synchronize_rcu()</tt> begins and the time that the RCU
> read-side critical section begins.
> Without this guarantee, a later RCU read-side critical section
> running after the <tt>kfree()</tt> on line 14 of
> <tt>remove_gp_synchronous()</tt> might
> later run <tt>do_something_gp()</tt> and find the
> newly deleted <tt>struct foo</tt>.
Ahem.
1. These guarantees are of full memory barriers, -not- compiler
barriers.
2. These rules don't say exactly where these full memory barriers
go. SRCU is at one extreme, placing those full barriers in
srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
at the other, placing these barriers entirely within the callback
queueing/invocation, grace-period computation, and the scheduler.
Preemptible Tree RCU is in the middle, with rcu_read_unlock()
sometimes including a full memory barrier, but other times with
the full memory barrier being confined as it is with !PREEMPT
Tree RCU.
So let's place those memory barriers in your example, interleaving:
CPU2: rcu_read_lock();
CPU1: WRITE_ONCE(a, 1) | CPU2: if (READ_ONCE(a) == 0)
CPU2: WRITE_ONCE(b, 1)
CPU2: rcu_read_unlock
/* Could put a full memory barrier here, but it wouldn't help. */
CPU1: synchronize_rcu
CPU1: b = 2;
Here the accesses to "a" are concurrent, and there are legal placements
for the required full memory barrier that don't change that. In fact,
I cannot see how any memory-barrier placement can help here. So the
WRITE_ONCE() and READ_ONCE() should be used for "a". And again, in
garden-variety RCU use cases, these would be rcu_assign_pointer() and
rcu_dereference(), respectively. So again, I advise using READ_ONCE()
and WRITE_ONCE() for the accesses to "a", for documentation purposes,
even ignoring the future proofing.
Now let's do the (admittedly quite crazy, at least here in 2019)
profiling-directed transformation of the "b = 1" on a weakly ordered
system:
CPU1: WRITE_ONCE(a, 1)
CPU1: synchronize_rcu
CPU1: b = 2;
CPU2: rcu_read_lock();
CPU2: if (READ_ONCE(a) == 0)
CPU2: if (b != 1)
CPU2: b = 1;
CPU2: rcu_read_unlock
Interleaving and inserting full memory barriers as per the rules above:
CPU1: WRITE_ONCE(a, 1)
CPU1: synchronize_rcu
/* Could put a full memory barrier here, but it wouldn't help. */
CPU2: rcu_read_lock();
CPU1: b = 2;
CPU2: if (READ_ONCE(a) == 0)
CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */
CPU2: b = 1;
CPU2: rcu_read_unlock
In fact, CPU2's load from b might be moved up to race with CPU1's store,
which (I believe) is why the model complains in this case.
I still cannot imagine why CPU1's "b = 2" needs to be WRITE_ONCE(),
perhaps due to a failure of imagination on my part.
> My review of the RCU code shows that these memory barriers are
> indeed present (at least when we're not in tiny mode where all
> this discussion would be moot anyway). For example, in call_rcu
> we eventually get down to rcu_segcblist_enqueue which has an smp_mb.
> On the reader side (correct me if I'm wrong Paul) the memory
> barrier is implicitly coming from the scheduler.
>
> My point is that within our kernel whenever we have a CPU memory
> barrier we always have a compiler barrier too. Therefore my code
> example above does not need any extra compiler barriers such as
> the ones provided by READ_ONCE/WRITE_ONCE.
For the garden-variety RCU use cases, this is true. Instead, they
require things that are as strong or stronger than READ_ONCE/WRITE_ONCE.
For example:
CPU 0:
p = kzalloc(sizeof(*p), GFP_KERNEL);
BUG_ON(!p);
p->a = 42;
r1 = p->b;
rcu_assign_pointer(gp, p);
CPU 1:
rcu_read_lock()
q = rcu_dereference(gp);
r2 = p->a;
p->b = 202;
rcu_read_unlock()
In this case, the plain C-language loads and stores cannot possibly
execute concurrently, and the model agrees with this. Again, we didn't
use WRITE_ONCE() and READ_ONCE() -- we instead used rcu_assign_pointer()
and rcu_dereference().
Similar examples involving (say) list_del_rcu() and synchronize_rcu()
are also safe for plain C-language loads and stores. Plain C-language
accesses by readers to the item being deleted cannot race with plain
C-language accesses after the synchronize_rcu() returns. But here we
are using list_del_rcu() in the update-side code along with the
synchronize_rcu() and the readers are again using rcu_dereference().
> I think perhaps Paul was perhaps thinking that I'm expecting
> rcu_read_lock/rcu_read_unlock themselves to provide the memory
> or compiler barriers. That would indeed be wrong but this is
> not what I need. All I need is the RCU semantics as documented
> for there to be memory and compiler barriers around the whole
> grace period.
Whew!!! That was -exactly- what I was thinking, and I am glad that
you are not advocating rcu_read_lock() and rcu_read_unlock() supplying
those barriers. ;-)
And again, it is early days for plain accesses for the Linux-kernel
memory model. I am confident that READ_ONCE() and WRITE_ONCE for
"a" is a very good thing to do, but less confident for "b", most
especially for CPU1's store to "b".
Either way, your example is an excellent test case for the plain
C-language access capability of the Linux-kernel memory model, so thank
you for that!
Thanx, Paul