Re: [RFC] tools/memory-model: Rule out OOTA
From: Jonas Oberhauser
Date: Wed Jan 08 2025 - 12:33:50 EST
Am 1/7/2025 um 5:09 PM schrieb Alan Stern:
On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
example shared on this list a few years ago:
P0(int *a, int *b, int *x, int *y) {
int r1;
r1 = READ_ONCE(*x);
smp_rmb();
if (r1 == 1) {
*a = *b;
}
smp_wmb();
WRITE_ONCE(*y, 1);
}
P1(int *a, int *b, int *x, int *y) {
int r1;
r1 = READ_ONCE(*y);
smp_rmb();
if (r1 == 1) {
*b = *a;
}
smp_wmb();
WRITE_ONCE(*x, 1);
}
exists b=42
The root cause is an interplay between plain accesses and rw-fences, i.e.,
smp_rmb() and smp_wmb(): while smp_rmb() and smp_wmb() provide sufficient
ordering for plain accesses to rule out data races, they do not in the current
formalization generally actually order the plain accesses, allowing, e.g., the
load and stores to *b to proceed in any order even if P1 reads from P0; and in
particular, the marked accesses around those plain accesses are also not
ordered, which causes this OOTA.
That's right. The memory model deliberately tries to avoid placing
restrictions on plain accesses, whenever it can.
In the example above, for instance, I think it's more interesting to ask
exists 0:r1=1 /\ 1:r1=1
than to concentrate on a and b.
Yes, and of course there's a relationship between the two anomalies.
My proposed patch solves OOTA indirectly, by forbidding the marked
accesses to *x and *y from being reordered in case r1 == 1, and thereby,
forbidding exactly the outcome where both have r1 == 1.
OOTA is a very difficult subject.
No doubt.
It can be approached only by making
the memory model take all sorts of compiler optimizations into account,
and doing this for all possible optimizations is not feasible.
I think there's two parts of this, one is the correct definition of
semantic (or compiler-preserved) dependencies. This is a really hard
problem that can indeed only be solved by looking at all ways the
compiler can optimize things. I'm not trying to solve that (and can't
address that issue in a cat file anyways).
The second part is to see which accesses could participate in an OOTA.
I think this is a lot simpler.
(For example, in a presentation to the C++ working group last year, Paul
and I didn't try to show how to extend the C++ memory model to exclude
OOTA [other than by fiat, as it does now]. Instead we argued that with
the existing memory model, no reasonable compiler would ever produce an
executable that could exhibit OOTA and so the memory model didn't need
to be changed.)
I think a model like C++ should exclude it exactly by fiat, by
formalizing OOTA-freedom as an axiom, like
acyclic ( (data|ctrl|addr) ; rfe )
This axiom is to some extent unsatisfying because it does not explain
*how* the OOTA is avoided, and in particular, it does not forbid any
specific kind of behavior that would lead to OOTA, just the complete
combination of them.
So for example, the following would be allowed:
P0:
r0 = y.load_explicit(memory_order_relaxed);
x.store_explicit(r0,memory_order_relaxed);
P1:
r1 = x; // reads 1
atomic_thread_fence();
y = 1;
which could by some people be interpreted as the accesses of P0 being
executed out of order (although there is no such concept in C++),
indicating that if P0's accesses are allowed to be executed out of
order, then so should P2's:
P2:
r2 = x.load_explicit(memory_order_relaxed);
y.store_explicit(r2,memory_order_relaxed);
Of course if both P0 and P2 are "executing out of order" at the same
time, one would have OOTA, but this "global behavior" would be forbidden
by the axiom.
But that is already how C++ works: a release access in C++ does not have
any ordering by itself either. It is just the combination of release +
acquire on other threads that somehow establishes synchronization.
LKMM works differently though, by providing "local" ordering rules,
through ppo. We can argue about a ppo locally even without knowing the
code of any other threads, let alone whether their accesses have acquire
or release semantics. So we can address OOTA in a "local" manner as well.
And it has the advantage of having compiler barriers around a lot of
things, which makes reasoning a lot more feasible.
In this patch, we choose the rather conservative approach of forcing only the
order of these marked accesses, specifically, when a marked read r is
separated from a plain read r' by an smp_rmb() (or r' has an address
dependency on r or is r'), on which a write w' depends, and w' is either plain
and seperated by a subsequent write w by an smp_wmb() (or w' is w), then r
precedes w in ppo.
Is this really valid? In the example above, if there were no other
references to a or b in the rest of the program, the compiler could
eliminate them entirely.
In the example above, this is not possible, because the address of a/b
have escaped the function and are not deallocated before synchronization
happens.
Therefore the compiler must assume that a/b are accessed inside the
compiler barrier.
(Whether the result could count as OOTA is
open to question, but that's not the point.) Is it not possible that a
compiler might find other ways to defeat your intentions?
The main point (which I already mentioned in the previous e-mail) is if
the object is deallocated without synchronization (or never escapes the
function in the first place).
And indeed, any such case renders the added rule unsound. It is in a
sense unrelated to OOTA; cases where the load/store can be elided are
never OOTA.
Of course that is outside the current scope of what herd7 needs to deal
with or can express, because deallocation isn't a thing in herd7.
Nevertheless, trying to specify inside cat when an access is "live" --
relevant enough that the compiler will keep it around -- is hard and
tedious (it's the main source of complication in the patch).
A much better way would be to add a base set of "live loads and stores"
Live, which are the loads and stores that the compiler must consider to
be live. Just like addr, ctrl, etc., we don't have to specify these in
cat, and rather rely on herd7 to correctly .
If an access interacts with an access of another thread (by reading from
it or being read from it), it must be live.
Then we could formulate the rule as
+let to-w = (overwrite & int) | (addr | rmb ; [Live])? ; rwdep ; ([Live]
; wmb)?
(the latter case being a generalization of the current `addr ; [Plain] ;
wmb` and `rwdep` cases of to-w, assuming we restrict it Life accesses -
it is otherwise also unsound:
int a[2] = {0};
int r1 = READ_ONCE(*x);
a[r1] = 0; // compiler will just remove this
smp_wmb();
WRITE_ONCE(*y, 1);
)
The formulation in the patch is just based on a complicated and close
but imperfect approximation of Live.
In any case, my feeling is that memory models for higher languages
(i.e., anything above the assembler level) should not try very hard to
address the question of OOTA. And for LKMM, OOTA involving _plain_
accesses is doubly out of bounds.
Your proposed change seems to add a significant complication to the
memory model for a not-very-clear benefit.
Even if we ignore OOTA for the moment, it is not inconceivable to have
some cases use a combination of plain accesses with
dependencies/rmb/wmb, such as in an RCU case.
That's probably the reason for the current addr ; [Plain] ; wmb exists.
It's not clear that it covers all cases though.
Best wishes,
jonas