Re: [RFC] tools/memory-model: Rule out OOTA

From: Jonas Oberhauser
Date: Thu Jan 09 2025 - 15:09:40 EST




Am 1/9/2025 um 8:27 PM schrieb Alan Stern:
On Thu, Jan 09, 2025 at 05:44:54PM +0100, Jonas Oberhauser wrote:


Am 1/9/2025 um 5:17 PM schrieb Alan Stern:
What do you mean by "opaque code in the compiler barrier"? The
compiler_barrier() instruction doesn't generate any code at all; it
merely directs the compiler not to carry any knowledge about values
stored in memory from one side of the barrier to the other.

What I mean by "opaque" is that the compiler does not analyze the code
inside the compiler barrier, so it must treat it as a black box that could
manipulate memory arbitrarily within the limitation that it can not guess
the address of memory.

Okay, I see what you're getting at. The way you express it is a little
confusing, because in fact there is NO code inside the compiler barrier
(although the compiler doesn't know that) -- the barrier() macro expands
to an empty assembler instruction, along with an annotation telling the
compiler that this instruction may affect the contents of memory in
unknown and unpredictable ways.

I am happy to learn a better way to express this.


So for example, in

int a = 1;
barrier();
a = 2;
//...

the compiler does not know how the code inside barrier() accesses memory,
including volatile memory.

I would say rather that the compiler does not know that the values
stored in memory are the same before and after the barrier().
>
Even the
values of local variables whose addresses have not been exported.

No, this is not true. I used to think so too until a short while ago.

But if you look at the output of gcc -o3 you will see that it does happily remove `a` in this function.


But it knows that it can not access `a`, because the address of `a` has
never escaped before the barrier().

> I don't think this is right. barrier is (or can be) a macro, not a
> function call with its own scope. As such, it has -- in principle --
> the ability to export the address of a.

See above. Please test it for yourself, but for your convenience here is the code


extern void foo(int *p);

int opt_a() {
int a;
a = 1;
__asm__ volatile ("" ::: "memory");
a = 2;
}

int opt_b() {
int a;
foo(&a);
a = 1;
__asm__ volatile ("" ::: "memory");
a = 2;
}

and corresponding asm:

opt_a:
bx lr

(the whole function body got deleted!)

opt_b:
push {lr}
sub sp, sp, #12
// calling foo(&a)
add r0, sp, #4
bl foo

// a = 1 -- to make sure a==1 before the barrier()
movs r3, #1
str r3, [sp, #4]

// [empty code from the barrier()]

// a = 2 -- totally deleted

// [return]
add sp, sp, #12
ldr pc, [sp], #4


If you wanted to avoid this, you would need to expose the address of `a` to the asm block using one of its clobber arguments (but I'm not familiar with the syntax to produce a working example on the spot).


Question: Can the compiler assume that no other threads access a between
the two stores, on the grounds that this would be a data race? I'd
guess that it can't make that assumption, but it would be nice to know
for sure.

It can not make the assumption if &a has escaped. In that case, barrier() could be so:

barrier(){
store_release(OTHER_THREAD_PLEASE_MODIFY,&a);

while (! load_acquire(OTHER_THREAD_IS_DONE));
}

with another thread doing

while (! load_acquire(OTHER_THREAD_PLEASE_MODIFY)) yield();
*OTHER_THREAD_PLEASE_MODIFY ++;
store_release(OTHER_THREAD_IS_DONE, 1);



But if we let the address of `a` escape, for example with some external
function foo(int*):

int a;
foo(&a);
a = 1;
barrier();
a = 2;
// ...

Then the compiler has to assume that the code of foo and barrier might be
something like this:

foo(p) { SPECIAL_VARIABLE = p; }
barrier() { TURN_THE_BREAKS_ON = *SPECIAL_VARIABLE; }

I think you're giving the compiler too much credit. The one thing the
compiler is allowed to assume is that the code, as written, does not
contain a data race or other undefined behavior.

Apologies, the way I used "assume" is misleading.
I should have said that the compiler has to ensure that even if the code of foo() and barrier() were so, that the behavior of the code it generates is the same (w.r.t. observable side effects) as if the program were executed by the abstract machine. Or I should have said that it can *not* assume that the functions are *not* as defined above.

Which means that TURN_THE_BREAKS_ON would need to be assigned 1.
The only way the compiler can achieve that guarantee (while treating barrier as a black box) is to make sure that the value of `a` before barrier() is 1.



That is at least my understanding.

This is the sort of question that memory-barriers.txt should answer.
It's closely related to the question I mentioned above.

In fact, even if `a` is unused after a=2, the compiler can only eliminate
`a` in the former case, but in the latter case, still needs to ensure that
the value of `a` before barrier() is 1 (but it can eliminate a=2).

And what if a were a global shared variable instead of a local one? The
compiler is still allowed to do weird optimizations on it, since the
accesses aren't volatile.

I'm not sure if `a` being global is enough for the compiler to consider `a`'s address as having escaped to the inline asm memory, especially if `a` has static lifetime.

If it must be considered escaped, then it can do weird optimizations on it, but not across the barrier().
That is because inside the barrier(), we could be reading the value and store into a volatile field like TURN_THE_BREAKS_ON.
In that case, the value right before the barrier needs to be equal to the value in the abstract machine.


That wasn't true in the C++ context of the paper Paul and I worked on.
Of course, C++ is not our current context here.

Yes, you are completely correct. In C++ (or pure C), where data races are
prevented by compiler/language-builtins rather than with
compiler-barriers/volatiles, all my assumptions break.

That is because the compiler absolutely knows that an atomic_store(&x) does
not access any memory location other than x, so it can do a lot more
"harmful" optimizations.

That's why I said such a language model should just exclude global OOTA by
fiat.

One problem with doing this is that there is no widely agreed-upon
formal definition of OOTA. A cycle in (rwdep ; rfe) isn't the answer
because rwdep does not encapsulate the notion of semantic dependency.

rwdep does not encapsulate any specific notion.
It is herd7 which decides which dependency edges to add to the graphs it generates.
If herd7 would generate edges for semantic dependencies instead of for its version of syntactic dependencies, then rwdep is the answer.

Given that we can not define dep inside the cat model, one may as well define it as rwdep;rfe with the intended meaning of the dependencies being the semantic ones; then it is an inaccuracy of herd7 that it does not provide the proper dependencies.

Anyways LKMM should not care about syntactic dependencies, e.g.


if (READ_ONCE(*a)) {
WRITE_ONCE(*b,1);
} else {
WRITE_ONCE(*b,1);
}

has no semantic dependency and gcc does not guarantee the order between these two accesses, even though herd7 does give us a dependency edge.


I have to read your paper again (I think I read it a few months ago) to
understand if the trivial OOTA would make even that vague axiom unsound
(my intuition says that if the OOTA is never observed by influencing the
side-effect, then forbidding OOTA makes no difference to the set of
"observable behaviors" of a C++ program even there is a trivial OOTA, and if
the OOTA has visible side-effects, then it is acceptable for the compiler
not to do the "optimization" that turns it into a trivial OOTA and choose
some other optimization instead, so we can as well forbid the compiler from
doing it).

If an instance of OOTA is never observed, does it exist?

:) :) :)

In the paper, I speculated that if a physical execution of a program
matches an abstract execution containing such a non-observed OOTA cycle,
then it also matches another abstract execution in which the cycle
doesn't exist. I don't know how to prove this conjecture, though.

Yes, that also makes sense.

Note that this speculation does not hold in the current LKMM though. In the Litmus test I shared in the opening e-mail, where the outcome 0:r1=1 /\ 1:r1=1 is only possible with an OOTA (even though the values from the OOTA are never used anywhere).
With C++'s non-local model I wouldn't be totally surprised if there were similar examples in C++, but given that its ordering definition is a lot more straightforward than LKMM in that it doesn't have all these cases of different barriers like wmb and rmb and corner cases like Noreturn etc., my intuition says that there aren't any.

But I am not going to think deeply about it for the time being.

Best wishes
jonas