Re: [PATCH] Linux: Implement membarrier function

From: Paul E. McKenney
Date: Tue Dec 11 2018 - 14:08:14 EST


On Tue, Dec 11, 2018 at 11:21:15AM -0500, Alan Stern wrote:
> On Mon, 10 Dec 2018, Paul E. McKenney wrote:
>
> > On Mon, Dec 10, 2018 at 11:22:31AM -0500, Alan Stern wrote:
> > > On Thu, 6 Dec 2018, Paul E. McKenney wrote:
> > >
> > > > Hello, David,
> > > >
> > > > I took a crack at extending LKMM to accommodate what I think would
> > > > support what you have in your paper. Please see the very end of this
> > > > email for a patch against the "dev" branch of my -rcu tree.
> > > >
> > > > This gives the expected result for the following three litmus tests,
> > > > but is probably deficient or otherwise misguided in other ways. I have
> > > > added the LKMM maintainers on CC for their amusement. ;-)
> > > >
> > > > Thoughts?
> > >
> > > Since sys_membarrier() provides a heavyweight barrier comparable to
> > > synchronize_rcu(), the memory model should treat the two in the same
> > > way. That's what this patch does.
> > >
> > > The corresponding critical section would be any region of code bounded
> > > by compiler barriers. Since the LKMM doesn't currently handle plain
> > > accesses, the effect is the same as if a compiler barrier were present
> > > between each pair of instructions. Basically, each instruction acts as
> > > its own critical section. Therefore the patch below defines memb-rscsi
> > > as the trivial identity relation. When plain accesses and compiler
> > > barriers are added to the memory model, a different definition will be
> > > needed.
> > >
> > > This gives the correct results for the three C-Goldblat-memb-* litmus
> > > tests in Paul's email.
> >
> > Yow!!!
> >
> > My first reaction was that this cannot possibly be correct because
> > sys_membarrier(), which is probably what we should call it, does not
> > wait for anything. But your formulation has the corresponding readers
> > being "id", which as you say above is just a single event.
> >
> > But what makes this work for the following litmus test?
> >
> > ------------------------------------------------------------------------
> >
> > C membrcu
> >
> > {
> > }
> >
> > P0(intptr_t *x0, intptr_t *x1)
> > {
> > WRITE_ONCE(*x0, 2);
> > smp_memb();
> > intptr_t r2 = READ_ONCE(*x1);
> > }
> >
> >
> > P1(intptr_t *x1, intptr_t *x2)
> > {
> > WRITE_ONCE(*x1, 2);
> > smp_memb();
> > intptr_t r2 = READ_ONCE(*x2);
> > }
> >
> >
> > P2(intptr_t *x2, intptr_t *x3)
> > {
> > WRITE_ONCE(*x2, 2);
> > smp_memb();
> > intptr_t r2 = READ_ONCE(*x3);
> > }
> >
> >
> > P3(intptr_t *x3, intptr_t *x4)
> > {
> > rcu_read_lock();
> > WRITE_ONCE(*x3, 2);
> > intptr_t r2 = READ_ONCE(*x4);
> > rcu_read_unlock();
> > }
> >
> >
> > P4(intptr_t *x4, intptr_t *x5)
> > {
> > rcu_read_lock();
> > WRITE_ONCE(*x4, 2);
> > intptr_t r2 = READ_ONCE(*x5);
> > rcu_read_unlock();
> > }
> >
> >
> > P5(intptr_t *x0, intptr_t *x5)
> > {
> > rcu_read_lock();
> > WRITE_ONCE(*x5, 2);
> > intptr_t r2 = READ_ONCE(*x0);
> > rcu_read_unlock();
> > }
> >
> > exists
> > (5:r2=0 /\ 0:r2=0 /\ 1:r2=0 /\ 2:r2=0 /\ 3:r2=0 /\ 4:r2=0)
> >
> > ------------------------------------------------------------------------
> >
> > For this, herd gives "Never". Of course, if I reverse the write and
> > read in any of P3(), P4(), or P5(), I get "Sometimes", which does make
> > sense. But what is preserving the order between P3() and P4() and
> > between P4() and P5()? I am not immediately seeing how the analogy
> > with RCU carries over to this case.
>
> That isn't how it works. Nothing preserves the orders you mentioned.
> It's more like: the order between P1 and P4 is preserved, as is the
> order between P0 and P5. You'll see below...
>
> (I readily agree that this result is not simple or obvious. It took me
> quite a while to formulate the following analysis.)

For whatever it is worth, David Goldblatt agrees with you to at
least some extent. I have sent him an inquiry. ;-)

> To begin with, since there aren't any synchronize_rcu calls in the
> test, the rcu_read_lock and rcu_read_unlock calls do nothing. They
> can be eliminated.

Agreed. I was just being lazy.

> Also, I find the variable names "x0" - "x5" to be a little hard to
> work with. If you don't mind, I'll replace them with "a" - "f".

Easy enough to translate, so have at it!

> Now, a little digression on how sys_membarrier works. It starts by
> executing a full memory barrier. Then it injects memory barriers into
> the instruction streams of all the other CPUs and waits for them all
> to complete. Then it executes an ending memory barrier.
>
> These barriers are ordered as described. Therefore we have
>
> mb0s < mb05 < mb0e,
> mb1s < mb14 < mb1e, and
> mb2s < mb23 < mb2e,
>
> where mb0s is the starting barrier of the sys_memb call on P0, mb05 is
> the barrier that it injects into P5, mb0e is the ending barrier of the
> call, and similarly for the other sys_memb calls. The '<' signs mean
> that the thing on their left finishes before the thing on their right
> does.
>
> Rewriting the litmus test in these terms gives:
>
> P0 P1 P2 P3 P4 P5
> Wa=2 Wb=2 Wc=2 [mb23] [mb14] [mb05]
> mb0s mb1s mb2s Wd=2 We=2 Wf=2
> mb0e mb1e mb2e Re=0 Rf=0 Ra=0
> Rb=0 Rc=0 Rd=0
>
> Here the brackets in "[mb23]", "[mb14]", and "[mb05]" mean that the
> positions of these barriers in their respective threads' program
> orderings is undetermined; they need not come at the top as shown.
>
> (Also, in case David is unfamiliar with it, the "Wa=2" notation is
> shorthand for "Write 2 to a" and "Rb=0" is short for "Read 0 from b".)
>
> Finally, here are a few facts which may be well known and obvious, but
> I'll state them anyway:
>
> A CPU cannot reorder instructions across a memory barrier.
> If x is po-after a barrier then x executes after the barrier
> is finished.
>
> If a store is po-before a barrier then the store propagates
> to every CPU before the barrier finishes.
>
> If a store propagates to some CPU before a load on that CPU
> reads from the same location, then the load will obtain the
> value from that store or a co-later store. This implies that
> if a load obtains a value co-earlier than some store then the
> load must have executed before the store propagated to the
> load's CPU.
>
> The proof consists of three main stages, each requiring three steps.
> Using the facts that b - f are all read as 0, I'll show that P1
> executes Rc before P3 executes Re, then that P0 executes Rb before P4
> executes Rf, and lastly that P5's Ra must obtain 2, not 0. This will
> demonstrate that the litmus test is not allowed.
>
> 1. Suppose that mb23 ends up coming po-later than Wd in P3.
> Then we would have:
>
> Wd propagates to P2 < mb23 < mb2e < Rd,
>
> and so Rd would obtain 2, not 0. Hence mb23 must come
> po-before Wd (as shown in the listing): mb23 < Wd.
>
> 2. Since mb23 therefore occurs po-before Re and instructions
> cannot be reordered across barriers, mb23 < Re.
>
> 3. Since Rc obtains 0, we must have:
>
> Rc < Wc propagates to P1 < mb2s < mb23 < Re.
>
> Thus Rc < Re.
>
> 4. Suppose that mb14 ends up coming po-later than We in P4.
> Then we would have:
>
> We propagates to P3 < mb14 < mb1e < Rc < Re,
>
> and so Re would obtain 2, not 0. Hence mb14 must come
> po-before We (as shown in the listing): mb14 < We.
>
> 5. Since mb14 therefore occurs po-before Rf and instructions
> cannot be reordered across barriers, mb14 < Rf.
>
> 6. Since Rb obtains 0, we must have:
>
> Rb < Wb propagates to P0 < mb1s < mb14 < Rf.
>
> Thus Rb < Rf.
>
> 7. Suppose that mb05 ends up coming po-later than Wf in P5.
> Then we would have:
>
> Wf propagates to P4 < mb05 < mb0e < Rb < Rf,
>
> and so Rf would obtain 2, not 0. Hence mb05 must come
> po-before Wf (as shown in the listing): mb05 < Wf.
>
> 8. Since mb05 therefore occurs po-before Ra and instructions
> cannot be reordered across barriers, mb05 < Ra.
>
> 9. Now we have:
>
> Wa propagates to P5 < mb0s < mb05 < Ra,
>
> and so Ra must obtain 2, not 0. QED.

Like this, then, with maximal reordering of P3-P5's reads?

P0 P1 P2 P3 P4 P5
Wa=2
mb0s
[mb05]
mb0e Ra=0
Rb=0 Wb=2
mb1s
[mb14]
mb1e Rf=0
Rc=0 Wc=2 Wf=2
mb2s
[mb23]
mb2e Re=0
Rd=0 We=2
Wd=2

But don't the sys_membarrier() calls affect everyone, especially given
the shared-variable communication? If so, why wouldn't this more strict
variant hold?

P0 P1 P2 P3 P4 P5
Wa=2
mb0s
[mb05] [mb05] [mb05]
mb0e
Rb=0 Wb=2
mb1s
[mb14] [mb14] [mb14]
mb1e
Rc=0 Wc=2
mb2s
[mb23] [mb23] [mb23]
mb2e Re=0 Rf=0 Ra=0
Rd=0 We=2 Wf=2
Wd=2

In which case, wouldn't this cycle be forbidden even if it had only one
sys_membarrier() call?

Ah, but the IPIs are not necessarily synchronized across the CPUs,
so that the following could happen:

P0 P1 P2 P3 P4 P5
Wa=2
mb0s
[mb05] [mb05] [mb05]
mb0e Ra=0
Rb=0 Wb=2
mb1s
[mb14] [mb14]
Rf=0
Wf=2
[mb14]
mb1e
Rc=0 Wc=2
mb2s
[mb23]
Re=0
We=2
[mb23] [mb23]
mb2e
Rd=0
Wd=2

I guess in light of this post in 2001, I really don't have an excuse,
do I? ;-)

https://lists.gt.net/linux/kernel/223555

Or am I still missing something here?

Thanx, Paul