Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)

From: Mathieu Desnoyers
Date: Tue Mar 17 2015 - 09:13:43 EST


----- Original Message -----
> On Tue, Mar 17, 2015 at 01:45:25AM +0000, Mathieu Desnoyers wrote:
> > Let's go through a memory ordering scenario to highlight my reasoning
> > there.
> >
> > Let's consider the following memory barrier scenario performed in
> > user-space on an architecture with very relaxed ordering. PowerPC comes
> > to mind.
> >
> > https://lwn.net/Articles/573436/
> > scenario 12:
> >
> > CPU 0 CPU 1
> > CAO(x) = 1; r3 = CAO(y);
> > cmm_smp_wmb(); cmm_smp_rmb();
> > CAO(y) = 1; r4 = CAO(x);
> >
> > BUG_ON(r3 == 1 && r4 == 0)
>
> WTF is CAO() ? and that ridiculous cmm_ prefix on the barriers.

In the LWN article, for the sake of conciseness, CAO is an alias
to the Linux kernel ACCESS_ONCE(). It maps to CMM_LOAD_SHARED(),
CMM_STORE_SHARED() in userspace RCU.

The cmm_ prefix means "Concurrent Memory Model", which is a prefix
used for memory access/ordering related headers in userspace RCU.
Because it is a userspace library, we need those prefixes, else we
end up clashing with pre-existing apps.

>
> > We tweak it to use sys_membarrier on CPU 1, and a simple compiler
> > barrier() on CPU 0:
> >
> > CPU 0 CPU 1
> > CAO(x) = 1; r3 = CAO(y);
> > barrier(); sys_membarrier();
> > CAO(y) = 1; r4 = CAO(x);
> >
> > BUG_ON(r3 == 1 && r4 == 0)
>
> That hardly seems like a valid substitution; barrier() is not a valid
> replacement of a memory barrier is it? Esp not on PPC.

That's the whole point of sys_membarrier. Quoting the td;dr changelog:

"It can be used to distribute the cost of user-space memory barriers
asymmetrically by transforming pairs of memory barriers into pairs
consisting of sys_membarrier() and a compiler barrier. For
synchronization primitives that distinguish between read-side and
write-side (e.g. userspace RCU, rwlocks), the read-side can be
accelerated significantly by moving the bulk of the memory barrier
overhead to the write-side."

So basically, for a given memory barrier pair, we can turn one barrier
into a sys_membarrier, which allows us to turn the other barrier into
a simple compiler barrier. Therefore, whenever the thread issuing
sys_membarrier actually cares about ordering of the matching barrier,
sys_membarrier() forces all other threads to issue a memory barrier,
which punctually promotes program order to an actual memory barrier
on all targeted threads.

>
> > Now if CPU 1 executes sys_membarrier while CPU 0 is preempted after both
> > stores, we have:
> >
> > CPU 0 CPU 1
> > CAO(x) = 1;
> > [1st store is slow to
> > reach other cores]
> > CAO(y) = 1;
> > [2nd store reaches other
> > cores more quickly]
> > [preempted]
> > r3 = CAO(y)
> > (may see y = 1)
> > sys_membarrier()
> > Scheduler changes rq->curr.
> > skips CPU 0, because rq->curr has
> > been updated.
> > [return to userspace]
> > r4 = CAO(x)
> > (may see x = 0)
> > BUG_ON(r3 == 1 && r4 == 0) -> fails.
> > load_cr3, with implied
> > memory barrier, comes
> > after CPU 1 has read "x".
> >
> > The only way to make this scenario work is if a memory barrier is added
> > before updating rq->curr. (we could also do a similar scenario for the
> > needed barrier after store to rq->curr).
>
> Hmmm.. like that. Light begins to dawn.
>
> So I think in this case we're good with the smp_mb__before_spinlock() we
> have; but do note its not a full MB even though the name says so.
>
> Its basically: WMB + ACQUIRE, which theoretically can leak a read in,
> but nobody sane _delays_ reads, you want to speculate reads, not
> postpone.

If I believe the memory ordering table at
https://en.wikipedia.org/wiki/Memory_ordering , there appears
to be quite a few architectures that can reorder loads after loads,
and loads after stores: Alpha, ARMv7, PA-RISC, SPARC RMO, x86 oostore
and ia64. There may be subtle details that would allow us to
do without the barriers in specific situations, but for that I'd
very much like to hear what Paul has to say.

>
> Also, it lacks the transitive property.

The lack of transitive property would likely be an issue
if we want to make this generally useful.

>
> > Would you see it as acceptable if we start by implementing
> > only the non-expedited sys_membarrier() ?
>
> Sure.
>
> > Then we can add
> > the expedited-private implementation after rq->curr becomes
> > available through RCU.
>
> Yeah, or not at all; I'm still trying to get Paul to remove the
> expedited nonsense from the kernel RCU bits; and now you want it in
> userspace too :/

The non-expedited case makes sense when we batch RCU work
with call_rcu. However, some users require to use synchronize_rcu()
directly after modifying their data structure. Therefore, the
latency associated with sys_membarrier() then becomes important,
hence the interest for an expedited scheme.

I agree that we should try to find a way to implement it with
low disturbance on the CPU's rq locks. I'd be certainly
OK with starting with just the non-expedited scheme, and add
the expedited scheme later on. This is why we have the flags
anyway.

Thanks!

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/