Re: liburcu: LTO breaking rcu_dereference on arm64 and possibly other architectures ?

From: Paul E. McKenney
Date: Fri Apr 16 2021 - 15:04:04 EST


On Fri, Apr 16, 2021 at 02:40:08PM -0400, Mathieu Desnoyers wrote:
> ----- On Apr 16, 2021, at 12:01 PM, paulmck paulmck@xxxxxxxxxx wrote:
>
> > On Fri, Apr 16, 2021 at 05:17:11PM +0200, Peter Zijlstra wrote:
> >> On Fri, Apr 16, 2021 at 10:52:16AM -0400, Mathieu Desnoyers wrote:
> >> > Hi Paul, Will, Peter,
> >> >
> >> > I noticed in this discussion https://lkml.org/lkml/2021/4/16/118 that LTO
> >> > is able to break rcu_dereference. This seems to be taken care of by
> >> > arch/arm64/include/asm/rwonce.h on arm64 in the Linux kernel tree.
> >> >
> >> > In the liburcu user-space library, we have this comment near rcu_dereference()
> >> > in
> >> > include/urcu/static/pointer.h:
> >> >
> >> > * The compiler memory barrier in CMM_LOAD_SHARED() ensures that
> >> > value-speculative
> >> > * optimizations (e.g. VSS: Value Speculation Scheduling) does not perform the
> >> > * data read before the pointer read by speculating the value of the pointer.
> >> > * Correct ordering is ensured because the pointer is read as a volatile access.
> >> > * This acts as a global side-effect operation, which forbids reordering of
> >> > * dependent memory operations. Note that such concern about dependency-breaking
> >> > * optimizations will eventually be taken care of by the "memory_order_consume"
> >> > * addition to forthcoming C++ standard.
> >> >
> >> > (note: CMM_LOAD_SHARED() is the equivalent of READ_ONCE(), but was introduced in
> >> > liburcu as a public API before READ_ONCE() existed in the Linux kernel)
> >> >
> >> > Peter tells me the "memory_order_consume" is not something which can be used
> >> > today.
> >> > Any information on its status at C/C++ standard levels and implementation-wise ?
> >
> > Actually, you really can use memory_order_consume. All current
> > implementations will compile it as if it was memory_order_acquire.
> > This will work correctly, but may be slower than you would like on ARM,
> > PowerPC, and so on.
> >
> > On things like x86, the penalty is forgone optimizations, so less
> > of a problem there.
>
> OK
>
> >
> >> > Pragmatically speaking, what should we change in liburcu to ensure we don't
> >> > generate
> >> > broken code when LTO is enabled ? I suspect there are a few options here:
> >> >
> >> > 1) Fail to build if LTO is enabled,
> >> > 2) Generate slower code for rcu_dereference, either on all architectures or only
> >> > on weakly-ordered architectures,
> >> > 3) Generate different code depending on whether LTO is enabled or not. AFAIU
> >> > this would only
> >> > work if every compile unit is aware that it will end up being optimized with
> >> > LTO. Not sure
> >> > how this could be done in the context of user-space.
> >> > 4) [ Insert better idea here. ]
> >
> > Use memory_order_consume if LTO is enabled. That will work now, and
> > might generate good code in some hoped-for future.
>
> In the context of a user-space library, how does one check whether LTO is enabled with
> preprocessor directives ? A quick test with gcc seems to show that both with and without
> -flto cannot be distinguished from a preprocessor POV, e.g. the output of both
>
> gcc --std=c11 -O2 -dM -E - < /dev/null
> and
> gcc --std=c11 -O2 -flto -dM -E - < /dev/null
>
> is exactly the same. Am I missing something here ?

No idea. ;-)

> If we accept to use memory_order_consume all the time in both C and C++ code starting from
> C11 and C++11, the following code snippet could do the trick:
>
> #define CMM_ACCESS_ONCE(x) (*(__volatile__ __typeof__(x) *)&(x))
> #define CMM_LOAD_SHARED(p) CMM_ACCESS_ONCE(p)
>
> #if defined (__cplusplus)
> # if __cplusplus >= 201103L
> # include <atomic>
> # define rcu_dereference(x) ((std::atomic<__typeof__(x)>)(x)).load(std::memory_order_consume)
> # else
> # define rcu_dereference(x) CMM_LOAD_SHARED(x)
> # endif
> #else
> # if (defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L)
> # include <stdatomic.h>
> # define rcu_dereference(x) atomic_load_explicit(&(x), memory_order_consume)
> # else
> # define rcu_dereference(x) CMM_LOAD_SHARED(x)
> # endif
> #endif
>
> This uses the volatile approach prior to C11/C++11, and moves to memory_order_consume
> afterwards. This will bring a performance penalty on weakly-ordered architectures even
> when -flto is not specified though.
>
> Then the burden is pushed on the compiler people to eventually implement an efficient
> memory_order_consume.
>
> Is that acceptable ?

That makes sense to me!

If it can be done reasonably, I suggest also having some way for the
person building userspace RCU to say "I know what I am doing, so do
it with volatile rather than memory_order_consume."

Thanx, Paul