Re: [GIT PULL rcu/next] RCU commits for 4.13

From: Paul E. McKenney
Date: Wed Jun 28 2017 - 13:03:43 EST


On Wed, Jun 28, 2017 at 11:31:55AM -0400, Alan Stern wrote:
> On Tue, 27 Jun 2017, Paul E. McKenney wrote:
>
> > On Tue, Jun 27, 2017 at 02:48:18PM -0700, Linus Torvalds wrote:
> > > On Tue, Jun 27, 2017 at 1:58 PM, Paul E. McKenney
> > > <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > So what next?
> > > >
> > > > One option would be to weaken the definition of spin_unlock_wait() so
> > > > that it had acquire semantics but not release semantics. Alternatively,
> > > > we could keep the full empty-critical-section semantics and add memory
> > > > barriers to spin_unlock_wait() implementations, perhaps as shown in the
> > > > patch below. I could go either way, though I do have some preference
> > > > for the stronger semantics.
> > > >
> > > > Thoughts?
> > >
> > > I would prefer to just say
> > >
> > > - document that spin_unlock_wait() has acquire semantics
> > >
> > > - mindlessly add the smp_mb() to all users
> > >
> > > - let users then decide if they are ok with just acquire
> > >
> > > That's partly because I think we actually have *fewer* users than we
> > > have implementations of spin_unlock_wait(). So adding a few smp_mb()'s
> > > in the users is actually likely the smaller change.
> >
> > You are right about that! There are only five invocations of
> > spin_unlock_wait() in the kernel, with a sixth that has since been
> > converted to spin_lock() immediately followed by spin_unlock().
> >
> > > But it's also because then that allows people who *can* say that
> > > acquire is sufficient to just use it. People who use
> > > spin_unlock_wait() tend to have some odd performance reason to do so,
> > > so I think allowing them to use the more light-weight memory ordering
> > > if it works for them is a good idea.
> > >
> > > But finally, it's partly because I think "acquire" semantics are
> > > actually the saner ones that we can explain the logic for much more
> > > clearly.
> > >
> > > Basically, acquire semantics means that you are guaranteed to see any
> > > changes that were done inside a previously locked region.
> > >
> > > Doesn't that sound like sensible semantics?
> >
> > It is the semantics that most implementations of spin_unlock_wait()
> > provide. Of the six invocations, two of them very clearly rely
> > only on the acquire semantics and two others already have the needed
> > memory barriers in place. I have queued one patch to add smp_mb()
> > to the remaining spin_unlock_wait() of the surviving five instances,
> > and another patch to convert the spin_lock/unlock pair to smp_mb()
> > followed by spin_unlock_wait().
> >
> > So, yes, it is a sensible set of semantics. At this point, agreeing
> > -any- reasonable semantics would be good, as it would allow us to get
> > locking added to the prototype Linux-kernel memory model. ;-)
> >
> > > Then, the argument for "smp_mb()" (before the spin_unlock_wait()) becomes:
> > >
> > > - I did a write that will affect any future lock takes
> > >
> > > - the smp_mb() now means that that write will be ordered wrt the
> > > acquire that guarantees we've seen all old actions taken by a lock.
> > >
> > > Does those kinds of semantics make sense to people?
>
> The problem is that adding smp_mb() before spin_unlock_wait() does not
> provide release semantics, as Andrea has pointed out. ISTM that when
> people want full release & acquire semantics, they should just use
> "spin_lock(); spin_unlock();".

Well, from what I can see, this thread is certainly demonstrating to my
satisfaction that we really do need a memory model. ;-)

I agree that calling a loops of loads a "release" is at best confusing,
and certainly conflicts with all know nomenclature. So let's forget
"release" or not and try to agree on litmus tests. Of course, as we
both know and as noted long ago on LKML, we cannot -define- semantics
via litmus tests, but we can use litmus tests to -check- semantics.

First, modeling lock acquisition with xchg(), which is fully ordered
(given that locking is still very much under development in our model):

C SUW+or-ow+l-ow-or
{}

P0(int *x, int *y, int *my_lock)
{
r0 = READ_ONCE(*x);
WRITE_ONCE(*y, 1);
smp_mb();
r1 = READ_ONCE(*my_lock);
}

P1(int *x, int *y, int *my_lock) {
r1 = xchg(my_lock, 1);
WRITE_ONCE(*x, 1);
r0 = READ_ONCE(*y);
}

exists (0:r0=1 /\ 0:r1=0 /\ 1:r0=0 /\ 1:r1=0)

This gives "Positive: 0 Negative: 5". But to your point, replacing
"xchg()" with "xchg_acquire()" gets us "Positive: 1 Negative: 7".

But xchg_acquire() would in fact work on x86 because the xchg instruction
does full ordering in any case, right? (I believe that this also applies
to the other TSO architectures, but have not checked.)

For PowerPC (and I believe ARM), the leading smp_mb() suffices, as
illustrated by this litmus test:

PPC spin_unlock_wait_st.litmus
""
{
l=0;
0:r1=1; 0:r3=42; 0:r4=x; 0:r10=0; 0:r12=l;
1:r1=1; 1:r3=42; 1:r4=x; 1:r10=0; 1:r11=0; 1:r12=l;
}
P0 | P1 ;
stw r1,0(r4) | lwarx r11,r10,r12 ;
sync | cmpwi r11,0 ;
lwarx r11,r10,r12 | bne Fail1 ;
stwcx. r11,r10,r12 | stwcx. r1,r10,r12 ;
bne Fail0 | bne Fail1 ;
lwz r3,0(r12) | lwsync ;
Fail0: | lwz r3,0(r4) ;
| Fail1: ;


exists
(0:r3=0 /\ 1:r3=0)

So what am I missing here?

> If there are any places where this would add unacceptable overhead,
> maybe those places need some rethinking. For instance, perhaps we
> could add a separate primitive that provides only release semantics.
> (But would using the new primitive together with spin_unlock_wait
> really be significantly better than lock-unlock?)

At the moment, I have no idea on the relative overheads. ;-)

Thanx, Paul