Re: [PATCH -tip 2/3] sched/wake_q: Relax to acquire semantics

From: Martin Schwidefsky
Date: Mon Sep 21 2015 - 05:23:08 EST


On Fri, 18 Sep 2015 14:41:20 -0700
"Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> wrote:

> On Tue, Sep 15, 2015 at 10:09:41AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 15, 2015 at 06:30:28PM +0200, Peter Zijlstra wrote:
> > > On Tue, Sep 15, 2015 at 08:34:48AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Sep 15, 2015 at 04:14:39PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Sep 15, 2015 at 07:09:22AM -0700, Paul E. McKenney wrote:
> > > > > > On Tue, Sep 15, 2015 at 02:48:00PM +0200, Peter Zijlstra wrote:
> > > > > > > On Tue, Sep 15, 2015 at 05:41:42AM -0700, Paul E. McKenney wrote:
> > > > > > > > > Never mind, the PPC people will implement this with lwsync and that is
> > > > > > > > > very much not transitive IIRC.
> > > > > > > >
> > > > > > > > I am probably lost on context, but...
> > > > > > > >
> > > > > > > > It turns out that lwsync is transitive in special cases. One of them
> > > > > > > > is a series of release-acquire pairs, which can extend indefinitely.
> > > > > > > >
> > > > > > > > Does that help in this case?
> > > > > > >
> > > > > > > Probably not, but good to know. I still don't think we want to rely on
> > > > > > > ACQUIRE/RELEASE being transitive in general though.
> > > > > >
> > > > > > OK, I will bite... Why not?
> > > > >
> > > > > It would mean us reviewing all archs (again) and documenting it I
> > > > > suppose. Which is of course entirely possible.
> > > > >
> > > > > That said, I don't think the case at hand requires it, so lets postpone
> > > > > this for now ;-)
> > > >
> > > > True enough, but in my experience smp_store_release() and
> > > > smp_load_acquire() are a -lot- easier to use than other barriers,
> > > > and transitivity will help promote their use. So...
> > > >
> > > > All the TSO architectures (x86, s390, SPARC, HPPA, ...) support transitive
> > > > smp_store_release()/smp_load_acquire() via their native ordering in
> > > > combination with barrier() macros. x86 with CONFIG_X86_PPRO_FENCE=y,
> > > > which is not TSO, uses an mfence instruction. Power supports this via
> > > > lwsync's partial cumulativity. ARM64 supports it in SMP via the new ldar
> > > > and stlr instructions (in non-SMP, it uses barrier(), which suffices
> > > > in that case). IA64 supports this via total ordering of all release
> > > > instructions in theory and by the actual full-barrier implementation
> > > > in practice (and the fact that gcc emits st.rel and ld.acq instructions
> > > > for volatile stores and loads). All other architectures use smp_mb(),
> > > > which is transitive.
> > > >
> > > > Did I miss anything?
> > >
> > > I think that about covers it.. the only odd duckling might be s390 which
> > > is documented as TSO but recently grew smp_mb__{before,after}_atomic(),
> > > which seems to confuse matters.
> >
> > Fair point, adding Martin and Heiko on CC for their thoughts.

Well we always had the full memory barrier for the various versions of
smp_mb__xxx, they just have moved around and renamed several times.

After discussing this with Heiko we came to the conclusion that we can use
a simple barrier() for smp_mb__before_atomic() and smp_mb__after_atomic().

> > It looks like this applies to recent mainframes that have new atomic
> > instructions, which, yes, might need something to make them work with
> > fully transitive smp_load_acquire() and smp_store_release().
> >
> > Martin, Heiko, the question is whether or not the current s390
> > smp_store_release() and smp_load_acquire() can be transitive.
> > For example, if all the Xi variables below are initially zero,
> > is it possible for all the r0, r1, r2, ... rN variables to
> > have the value 1 at the end of the test.
>
> Right... This time actually adding Martin and Heiko on CC...
>
> Thanx, Paul
>
> > CPU 0
> > r0 = smp_load_acquire(&X0);
> > smp_store_release(&X1, 1);
> >
> > CPU 1
> > r1 = smp_load_acquire(&X1);
> > smp_store_release(&X2, 1);
> >
> > CPU 2
> > r2 = smp_load_acquire(&X2);
> > smp_store_release(&X3, 1);
> >
> > ...
> >
> > CPU N
> > rN = smp_load_acquire(&XN);
> > smp_store_release(&X0, 1);
> >
> > If smp_store_release() and smp_load_acquire() are transitive, the
> > answer would be "no".

The answer is "no". Christian recently summarized what the principles of
operation has to say about the CPU read / write behavior. If you consider
the sequential order of instructions then

1) reads are in order
2) writes are in order
3) reads can happen earlier
4) writes can happen later

> > A similar litmus test involving atomics would be as follows, again
> > with all Xi initially zero:
> >
> > CPU 0
> > atomic_inc(&X0);
> > smp_store_release(&X1, 1);
> >
> > CPU 1
> > r1 = smp_load_acquire(&X1);
> > smp_store_release(&X2, 1);
> >
> > CPU 2
> > r2 = smp_load_acquire(&X2);
> > smp_store_release(&X3, 1);
> >
> > ...
> >
> > CPU N
> > rN = smp_load_acquire(&XN);
> > r0 = atomic_read(&X0);
> >
> > Here, the question is whether r0 can be zero, but r1, r2, ... rN all
> > being 1 at the end of the test.

r0 = 0 and all r1, r2, ... rN = 1 can not happen on s390.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/