Re: [RFC][PATCH 0/5] arch: atomic rework

From: Torvald Riegel
Date: Thu Feb 06 2014 - 16:10:29 EST


On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote:
> On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote:
> > On 02/06/14 18:25, David Howells wrote:
> > >
> > > Is it worth considering a move towards using C11 atomics and barriers and
> > > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do
> > > these.
> >
> >
> > It sounds interesting to me, if we can make it work properly and
> > reliably. + gcc@xxxxxxxxxxx for others in the GCC community to chip in.
>
> Given my (albeit limited) experience playing with the C11 spec and GCC, I
> really think this is a bad idea for the kernel.

I'm not going to comment on what's best for the kernel (simply because I
don't work on it), but I disagree with several of your statements.

> It seems that nobody really
> agrees on exactly how the C11 atomics map to real architectural
> instructions on anything but the trivial architectures.

There's certainly different ways to implement the memory model and those
have to be specified elsewhere, but I don't see how this differs much
from other things specified in the ABI(s) for each architecture.

> For example, should
> the following code fire the assert?

I don't see how your example (which is about what the language requires
or not) relates to the statement about the mapping above?

>
> extern atomic<int> foo, bar, baz;
>
> void thread1(void)
> {
> foo.store(42, memory_order_relaxed);
> bar.fetch_add(1, memory_order_seq_cst);
> baz.store(42, memory_order_relaxed);
> }
>
> void thread2(void)
> {
> while (baz.load(memory_order_seq_cst) != 42) {
> /* do nothing */
> }
>
> assert(foo.load(memory_order_seq_cst) == 42);
> }
>

It's a good example. My first gut feeling was that the assertion should
never fire, but that was wrong because (as I seem to usually forget) the
seq-cst total order is just a constraint but doesn't itself contribute
to synchronizes-with -- but this is different for seq-cst fences.

> To answer that question, you need to go and look at the definitions of
> synchronises-with, happens-before, dependency_ordered_before and a whole
> pile of vaguely written waffle to realise that you don't know.

Are you familiar with the formalization of the C11/C++11 model by Batty
et al.?
http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf
http://www.cl.cam.ac.uk/~mjb220/n3132.pdf

They also have a nice tool that can run condensed examples and show you
all allowed (and forbidden) executions (it runs in the browser, so is
slow for larger examples), including nice annotated graphs for those:
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

It requires somewhat special syntax, but the following, which should be
equivalent to your example above, runs just fine:

int main() {
atomic_int foo = 0;
atomic_int bar = 0;
atomic_int baz = 0;
{{{ {
foo.store(42, memory_order_relaxed);
bar.store(1, memory_order_seq_cst);
baz.store(42, memory_order_relaxed);
}
||| {
r1=baz.load(memory_order_seq_cst).readsvalue(42);
r2=foo.load(memory_order_seq_cst).readsvalue(0);
}
}}};
return 0; }

That yields 3 consistent executions for me, and likewise if the last
readsvalue() is using 42 as argument.

If you add a "fence(memory_order_seq_cst);" after the store to foo, the
program can't observe != 42 for foo anymore, because the seq-cst fence
is adding a synchronizes-with edge via the baz reads-from.

I think this is a really neat tool, and very helpful to answer such
questions as in your example.

> Certainly,
> the code that arm64 GCC currently spits out would allow the assertion to fire
> on some microarchitectures.
>
> There are also so many ways to blow your head off it's untrue. For example,
> cmpxchg takes a separate memory model parameter for failure and success, but
> then there are restrictions on the sets you can use for each.

That's in there for the architectures without a single-instruction
CAS/cmpxchg, I believe.

> It's not hard
> to find well-known memory-ordering experts shouting "Just use
> memory_model_seq_cst for everything, it's too hard otherwise".

Everyone I've heard saying this meant this as advice to people new to
synchronization or just dealing infrequently with it. The advice is the
simple and safe fallback, and I don't think it's meant as an
acknowledgment that the model itself would be too hard. If the
language's memory model is supposed to represent weak HW memory models
to at least some extent, there's only so much you can do in terms of
keeping it simple. If all architectures had x86-like models, the
language's model would certainly be simpler... :)

> Then there's
> the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume
> atm and optimises all of the data dependencies away)

AFAIK consume memory order was added to model Power/ARM-specific
behavior. I agree that the way the standard specifies how dependencies
are to be preserved is kind of vague (as far as I understand it). See
GCC PR 59448.

> as well as the definition
> of "data races", which seem to be used as an excuse to miscompile a program
> at the earliest opportunity.

No. The purpose of this is to *not disallow* every optimization on
non-synchronizing code. Due to the assumption of data-race-free
programs, the compiler can assume a sequential code sequence when no
atomics are involved (and thus, keep applying optimizations for
sequential code).

Or is there something particular that you dislike about the
specification of data races?

> Trying to introduce system concepts (writes to devices, interrupts,
> non-coherent agents) into this mess is going to be an uphill battle IMHO.

That might very well be true.

OTOH, if you whould need to model this uniformly across different
architectures (ie, so that there is a intra-kernel-portable abstraction
for those system concepts), you might as well try doing this by
extending the C11/C++11 model. Maybe that will not be successful or not
really a good fit, though, but at least then it's clear why that's the
case.

> I'd
> just rather stick to the semantics we have and the asm volatile barriers.
>
> That's not to say I don't there's no room for improvement in what we have
> in the kernel. Certainly, I'd welcome allowing more relaxed operations on
> architectures that support them, but it needs to be something that at least
> the different architecture maintainers can understand how to implement
> efficiently behind an uncomplicated interface. I don't think that interface is
> C11.

IMHO, one thing worth considering is that for C/C++, the C11/C++11 is
the only memory model that has widespread support. So, even though it's
a fairly weak memory model (unless you go for the "only seq-cst"
beginners advice) and thus comes with a higher complexity, this model is
what likely most people will be familiar with over time. Deviating from
the "standard" model can have valid reasons, but it also has a cost in
that new contributors are more likely to be familiar with the "standard"
model.

Note that I won't claim that the C11/C++11 model is perfect -- there are
a few rough edges there (e.g., the forward progress guarantees are (or
used to be) a little coarse for my taste), and consume vs. dependencies
worries me as well. But, IMHO, overall it's the best C/C++ language
model we have.


Torvald


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/