Re: [RFC] tools/memory-model: Rule out OOTA
From: Alan Stern
Date: Fri Jan 10 2025 - 16:52:05 EST
On Fri, Jan 10, 2025 at 01:21:32PM +0100, Jonas Oberhauser wrote:
>
>
> Am 1/10/2025 um 4:12 AM schrieb Alan Stern:
> > On Thu, Jan 09, 2025 at 09:09:00PM +0100, Jonas Oberhauser wrote:
> >
> > > > > So for example, in
> > > > >
> > > > > int a = 1;
> > > > > barrier();
> > > > > a = 2;
> > > > > //...
> > > > >
> > > > > the compiler does not know how the code inside barrier() accesses memory,
> > > > > including volatile memory.
> > > >
> > > > I would say rather that the compiler does not know that the values
> > > > stored in memory are the same before and after the barrier().
> > > >
> > > > Even the
> > > > values of local variables whose addresses have not been exported.
> > >
> > > No, this is not true. I used to think so too until a short while ago.
> > >
> > > But if you look at the output of gcc -o3 you will see that it does happily
> > > remove `a` in this function.
> >
> > Isn't that consistent with what I said?
>
>
> Ok, after careful reading, I think there are two assumptions you have that I
> think are not true, but my example is only inconsistent with exactly one of
> them being not true, not with both of them being not true:
>
> 1) the barrier only tells the compiler that it may *change* the value of
> memory locations. I believe it also tells the compiler that it may *read*
> the value of memory locations.
> 2) the barrier also talks about the values of local variables whose
> addresses have not been exported. I do not believe this is the case.
I checked the GCC manual. You are right about 1); the compiler is
required to guarantee that the contents of memory before the barrier are
fully up-to-date (no dirty values remaining in registers or
temporaries).
2) isn't so clear. If a local variable's address is never computed then
the compiler might put the variable in a register, in which case the
barrier would not clobber it. On the other hand, if the variable's
address is computed somewhere (even if it isn't exported) then the
variable can't be kept in a register and so it is subject to the
barrier's effects.
The manual says:
Using the ‘"memory"’ clobber effectively forms a read/write
memory barrier for the compiler.
Not a word about what happens if a variable has the "register" storage
class, for example, and so might never be stored in memory at all. But
this leaves the programmer with _no_ way to specify a memory barrier for
such variables!
Of course, the fact that these variables cannot be exposed to outside
code does mitigate the problem...
> For what gcc guarantees, the manual says: "add memory to the list of
> clobbered registers. This will cause GCC to not keep memory values cached in
> registers across the assembler instruction", i.e., it needs to flush the
> value from the register to actual memory.
Yes, GCC will write back dirty values from registers. But not because
the cached values will become invalid (in fact, the cached values might
not even be used after the barrier). Rather, because the compiler is
required to assume that the assembler code in the barrier might access
arbitrary memory locations -- even if that code is empty.
> > > > Question: Can the compiler assume that no other threads access a between
> > > > the two stores, on the grounds that this would be a data race? I'd
> > > > guess that it can't make that assumption, but it would be nice to know
> > > > for sure.
> > >
> > > It can not make the assumption if &a has escaped. In that case, barrier()
> > > could be so:
> > >
> > > barrier(){
> > > store_release(OTHER_THREAD_PLEASE_MODIFY,&a);
> > >
> > > while (! load_acquire(OTHER_THREAD_IS_DONE));
> > > }
> > >
> > > with another thread doing
> > >
> > > while (! load_acquire(OTHER_THREAD_PLEASE_MODIFY)) yield();
> > > *OTHER_THREAD_PLEASE_MODIFY ++;
> > > store_release(OTHER_THREAD_IS_DONE, 1);
Okay, yes, the compiler can't know whether the assembler code will do
this. But as far as I know, there is no specification about whether
inline assembler can synchronize with code in another thread (in the
sense used by the C++ memory model) and therefore create a
happens-before link. Language specifications tend to ignore issues
like inline assembler.
Does this give the compiler license to believe no such link can exist
and therefore accesses to these non-atomic variables by another thread
concurrent with the barrier would be data races?
In the end maybe this doesn't matter.
> > > If herd7 would generate edges for semantic dependencies instead of for its
> > > version of syntactic dependencies, then rwdep is the answer.
> >
> > That statement is meaningless (or at least, impossible to implement)
> > because there is no widely agreed-upon formal definition for semantic
> > dependency.
>
> Yes, which also means that a 100% correct end-to-end solution (herd + cat +
> ... ?) is currently not implementable.
>
> But we can still break the problem into two halves, one which is 100% solved
> inside the cat file, and one which is the responsibility of herd7 and
> currently not solved (or 100% satisfactorily solvable).
I believe that Luc and the other people involved with herd7 take the
opposite point of view: herd7 is intended to do the "easy" analysis
involving only straightforward code parsing, leaving the "hard"
conceptual parts to the user-supplied .cat file.
> The advantage being that if we read the cat file as a mathematical
> definition, we can at least on paper argue 100% correctly about code for the
> cases where we either can figure out on paper what the semantic dependencies
> are, or where we at least just say "with relation to current compilers, we
> know what the semantically preserved dependencies are", even if herd7 or
> other tools lags behind in one or both.
>
> After all, herd7 is just a (useful) automation tool for reasoning about
> LKMM, which has its limitations (scalability, a specific definition of
> dependencies, limited C subset...).
I still think we should not attempt any sort of formalization of
semantic dependency here.
> > > Given that we can not define dep inside the cat model, one may as well
> > > define it as rwdep;rfe with the intended meaning of the dependencies being
> > > the semantic ones; then it is an inaccuracy of herd7 that it does not
> > > provide the proper dependencies.
> >
> > Perhaps so. LKMM does include other features which the compiler can
> > defeat if the programmer isn't sufficiently careful.
>
> How many of these are due to herd7's limitations vs. in the cat file?
Important limitations are present in both.
> > > > In the paper, I speculated that if a physical execution of a program
> > > > matches an abstract execution containing such a non-observed OOTA cycle,
> > > > then it also matches another abstract execution in which the cycle
> > > > doesn't exist. I don't know how to prove this conjecture, though.
> > >
> > > Yes, that also makes sense.
> > >
> > > Note that this speculation does not hold in the current LKMM though. In the
> > > Litmus test I shared in the opening e-mail, where the outcome 0:r1=1 /\
> > > 1:r1=1 is only possible with an OOTA (even though the values from the OOTA
> > > are never used anywhere).
> >
> > If the fact that the outcome 0:r1=1 /\ 1:r1=1 has occurred is proof that
> > there was OOTA, then the OOTA cycle _is_ observed, albeit indirectly --
> > at least, in the sense that I intended. (The situation mentioned in the
> > paper is better described as an execution where the compiler has elided
> > all the accesses in the OOTA cycle.)
>
> I'm not sure that sense makes a lot of sense to me.
Here's an example illustrating what I had in mind. Imagine that all the
accesses here are C++-style relaxed atomic (i.e., not volatile and also
not subject to data races):
P0(int *a, int *b) {
int r0 = *a;
*b = r0;
*b = 2;
// r0 not used again
}
P1(int *a, int *b) {
int r1 = *b;
*a = r1;
*a = 2;
// r1 not used again
}
The compiler could eliminate the r0 and r1 accesses entirely, leaving
just:
P0(int *a, int *b) {
*b = 2;
}
P1(int *a, int *b) {
*a = 2;
}
An execution of the corresponding machine code would then be compatible
with an abstract execution of the source code in which both r0 and r1
get set to 42 (OOTA). But it would also be compatible with an abstract
execution in which both r0 and r1 are 0, so it doesn't make sense to say
that the hardware execution is, or might be, an instance of OOTA.
> But it does make the proof of your claim totally trivial. If there is no
> other OOTA-free execution with the same observable behavior, then it is
> proof that the OOTA happened, so the OOTA was observed.
> So by contraposition any non-observed OOTA has an OOTA-free execution with
> the same observable behavior.
>
> The sense in which I would define observed is more along the lines of "there
> is an observable side effect (such as store to volatile location) which has
> a semantic dependency on a load that reads from one of the stores in the
> OOTA cycle".
Yes, I can see that from your proposed definition of Live.
I'm afraid we've wandered off the point of this email thread, however...
Getting back to the original point, why don't you rewrite your patch as
discussed earlier and describe it as an attempt to add ordering for
important situations involving plain accesses that the LKMM currently
does not handle? In other words, leave out as far as possible any
mention of OOTA or semantic dependency.
Alan