Re: [RFC] tools/memory-model: Rule out OOTA

From: Jonas Oberhauser
Date: Fri Jan 10 2025 - 07:22:24 EST




Am 1/10/2025 um 4:12 AM schrieb Alan Stern:
On Thu, Jan 09, 2025 at 09:09:00PM +0100, Jonas Oberhauser wrote:

So for example, in

int a = 1;
barrier();
a = 2;
//...

the compiler does not know how the code inside barrier() accesses memory,
including volatile memory.

I would say rather that the compiler does not know that the values
stored in memory are the same before and after the barrier().

Even the
values of local variables whose addresses have not been exported.

No, this is not true. I used to think so too until a short while ago.

But if you look at the output of gcc -o3 you will see that it does happily
remove `a` in this function.

Isn't that consistent with what I said?


Ok, after careful reading, I think there are two assumptions you have that I think are not true, but my example is only inconsistent with exactly one of them being not true, not with both of them being not true:

1) the barrier only tells the compiler that it may *change* the value of memory locations. I believe it also tells the compiler that it may *read* the value of memory locations.
2) the barrier also talks about the values of local variables whose addresses have not been exported. I do not believe this is the case.

The second example I put (where a=1 is still emitted) shows your assumption 1) is inconsistent with what gcc currently does.

For what gcc guarantees, the manual says: "add memory to the list of clobbered registers. This will cause GCC to not keep memory values cached in registers across the assembler instruction", i.e., it needs to flush the value from the register to actual memory.

I believe this too is not consistent with your assumption 1), because if the barrier would just modify memory and not read it, there would be no need to flush the value to memory. It would suffice to ensure that the value is not assumed to be the same after the barrier.

With your assumption 1) discharged, the fact that `a=1` still can be removed from before the barrier should show that this guarantee does not hold for all memory locations (only for those that could legally be accessed in the barrier, which are those whose address has been exported).



Question: Can the compiler assume that no other threads access a between
the two stores, on the grounds that this would be a data race? I'd
guess that it can't make that assumption, but it would be nice to know
for sure.

It can not make the assumption if &a has escaped. In that case, barrier()
could be so:

barrier(){
store_release(OTHER_THREAD_PLEASE_MODIFY,&a);

while (! load_acquire(OTHER_THREAD_IS_DONE));
}

with another thread doing

while (! load_acquire(OTHER_THREAD_PLEASE_MODIFY)) yield();
*OTHER_THREAD_PLEASE_MODIFY ++;
store_release(OTHER_THREAD_IS_DONE, 1);

Bear in mind that there's a difference between what a compiler _can do_
and what gcc _currently does_.

Of course. But I am not sure if your comment addresses my comment here, or was intended for another section.


I think you're giving the compiler too much credit. The one thing the
compiler is allowed to assume is that the code, as written, does not
contain a data race or other undefined behavior.

Apologies, the way I used "assume" is misleading.
I should have said that the compiler has to ensure that even if the code of
foo() and barrier() were so, that the behavior of the code it generates is
the same (w.r.t. observable side effects) as if the program were executed by
the abstract machine. Or I should have said that it can *not* assume that
the functions are *not* as defined above.

Which means that TURN_THE_BREAKS_ON would need to be assigned 1.
The only way the compiler can achieve that guarantee (while treating barrier
as a black box) is to make sure that the value of `a` before barrier() is 1.

Who says the compiler has to treat barrier() as a black box? As far as
I know, gcc makes no such guarantee.

Note that if it wouldn't, then barrier() would not work. Since the asm instruction is empty, the compiler could figure this out easily and just delete it.

But maybe I should also be more precise about what I mean by black box, namely, that
1) the asm block has significant side effects,
2) which include reading and storing to arbitrary (legally known) memory locations in an arbitrary order & control flow

Both of these imply that the compiler can not assume that it does not execute some logic equivalent to what I put above.




If herd7 would generate edges for semantic dependencies instead of for its
version of syntactic dependencies, then rwdep is the answer.

That statement is meaningless (or at least, impossible to implement)
because there is no widely agreed-upon formal definition for semantic
dependency.

Yes, which also means that a 100% correct end-to-end solution (herd + cat + ... ?) is currently not implementable.

But we can still break the problem into two halves, one which is 100% solved inside the cat file, and one which is the responsibility of herd7 and currently not solved (or 100% satisfactorily solvable).

The advantage being that if we read the cat file as a mathematical definition, we can at least on paper argue 100% correctly about code for the cases where we either can figure out on paper what the semantic dependencies are, or where we at least just say "with relation to current compilers, we know what the semantically preserved dependencies are", even if herd7 or other tools lags behind in one or both.

After all, herd7 is just a (useful) automation tool for reasoning about LKMM, which has its limitations (scalability, a specific definition of dependencies, limited C subset...).


Given that we can not define dep inside the cat model, one may as well
define it as rwdep;rfe with the intended meaning of the dependencies being
the semantic ones; then it is an inaccuracy of herd7 that it does not
provide the proper dependencies.

Perhaps so. LKMM does include other features which the compiler can
defeat if the programmer isn't sufficiently careful.

How many of these are due to herd7's limitations vs. in the cat file?

In the paper, I speculated that if a physical execution of a program
matches an abstract execution containing such a non-observed OOTA cycle,
then it also matches another abstract execution in which the cycle
doesn't exist. I don't know how to prove this conjecture, though.

Yes, that also makes sense.

Note that this speculation does not hold in the current LKMM though. In the
Litmus test I shared in the opening e-mail, where the outcome 0:r1=1 /\
1:r1=1 is only possible with an OOTA (even though the values from the OOTA
are never used anywhere).

If the fact that the outcome 0:r1=1 /\ 1:r1=1 has occurred is proof that
there was OOTA, then the OOTA cycle _is_ observed, albeit indirectly --
at least, in the sense that I intended. (The situation mentioned in the
paper is better described as an execution where the compiler has elided
all the accesses in the OOTA cycle.)

I'm not sure that sense makes a lot of sense to me.
But it does make the proof of your claim totally trivial. If there is no other OOTA-free execution with the same observable behavior, then it is proof that the OOTA happened, so the OOTA was observed.
So by contraposition any non-observed OOTA has an OOTA-free execution with the same observable behavior.

The sense in which I would define observed is more along the lines of "there is an observable side effect (such as store to volatile location) which has a semantic dependency on a load that reads from one of the stores in the OOTA cycle".





With C++'s non-local model I wouldn't be totally surprised if there were
similar examples in C++, but given that its ordering definition is a lot
more straightforward than LKMM in that it doesn't have all these cases of
different barriers like wmb and rmb and corner cases like Noreturn etc., my
intuition says that there aren't any.

I'll have to give this some thought.

If you do want to prove the claim for the stricter definition of observable OOTA, I would think about looking at the first read r in each thread that reads from an OOTA store w, and see if at least one of them could read from another store.

If that is not possible, then my intuition would be that there would be some happens-before relation blocking you from reading from an earlier store than w, in particular, w ->hb r.

If it is not possible on any thread, then you get a bunch of these hb edges much in parallel to the OOTA cycle.

Perhaps you can turn that into an hb cycle.

This last step doesn't work in LKMM because the hb may be caused by rmb/wmb, which does not extend over the plain accesses in the OOTA cycle to the bounding store to make a bigger hb cycle.

But in C++, if you have w ->hb r on each OOTA rfe edge, then you also have r ->hb w' along the OOTA dep edge, and you get an hb cycle.

Best wishes,
jonas