Re: [RFC] tools/memory-model: Rule out OOTA

From: Jonas Oberhauser
Date: Sun Jan 12 2025 - 10:57:46 EST

Next message: linux: "[PATCH] timekeeping: Remove unused ktime_get_fast_timestamps"
Previous message: Andy Shevchenko: "Re: [PATCH v9 2/4] iio: accel: adxl345: initialize FIFO delay value for SPI"
In reply to: Alan Stern: "Re: [RFC] tools/memory-model: Rule out OOTA"
Next in thread: Alan Stern: "Re: [RFC] tools/memory-model: Rule out OOTA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Am 1/11/2025 um 10:19 PM schrieb Alan Stern:

On Sat, Jan 11, 2025 at 01:46:21PM +0100, Jonas Oberhauser wrote:

This is under my assumption that if we had let's say gcc's "semantic
dependencies" or an under-approximation of it (by that I mean allow less
things to be dependent than gcc can see), that these cases would be
resolved, in the sense that gcc can not undermine [R & Marked] ; gcc-dep
where gcc-dep is the dependencies detected by gcc.

That seems circular. Basically, you're saying the gcc will not break
any dependencies that gcc classifies as not-breakable!

Maybe my formulation is not exactly what I meant to express.
I am thinking of examples like this,

r1 = READ_ONCE(*x);
*b = r1;

~~>

r1 = READ_ONCE(*x);
if (*b != r1) {
*b = r1;
}

Here there is clearly a dependency to a store, but gcc might turn it into an independent load (in case *b == r1).

Just because gcc admits that there is a dependency, does not necessarily mean that it will not still undermine the ordering "bestowed upon" that dependency by a memory model in some creative way.

The cases that I could think of all still worked for very specific architecture-specific reasons (e.g., x86 has CMOV but all loads provide acquire-ordering, and arm does not have flag-conditional str, etc.)

Or perhaps there is no dependency in case *b == r1. I am not sure.

Another thought that pops up here is that when I last worked on formalizing dependencies, I could not define dependencies as being between one load and one store, a dependency might be between a set of loads and one store. I would have to look up the exact reason, but I think it was because sometimes you need to change more than one value to influence the result, e.g., a && b where both a and b are 0 - just changing one will not make a difference.

All of these complications make me wonder whether even a relational notion of semantic dependency is good enough.

Alan Stern:

Perhaps so. LKMM does include other features which the compiler can
defeat if the programmer isn't sufficiently careful.

How many of these are due to herd7's limitations vs. in the cat file?

Important limitations are present in both.

I am genuinely asking. Do we have a list of the limitations?
Maybe it would be good to collect it in the "A WARNING" section of
explanation.txt if it doesn't exist elsewhere.

There are a few listed already at various spots in explanation.txt --
search for "undermine". And yes, many or most of these limitations do
arise from LKMM's failure to recognize when a dependency isn't semantic.
Maybe some are also related to undefined behavior, which LKMM is not
aware of.

Thanks. Another point mentioned in that document is the total order of po, whereas C has a more relaxed notion of sequenced-before; this could even affect volatile accesses, e.g. in f(READ_ONCE(*a), g()). where g() calls an rmb and then READ_ONCE(*b), and it is not clear whether there should be a ppo from reading *a to *b in some executions or not.

There is one other weakness I know of, however -- something totally
different. It's an instance in which the formal model in the .cat file
fails to capture the intent of the informal operational model.

As I recall, it goes like this: The operational model asserts that an
A-cumulative fence (like a release fence) on a CPU ensures ordering for
all stores that propagate to that CPU before the fence is executed. But
the .cat code says only that this ordering applies to stores which the
CPU reads from before the fence is executed. I believe you can make up
litmus tests where you can prove that a store must have propagated to
the CPU before an A-cumulative fence occurs, even though the CPU doesn't
read from that store; for such examples the LKMM may accept executions
that shouldn't be allowed.

There may even be an instance of this somewhere in the various litmus
test archives; I don't remember.

Ah, I think you explained this case to me before. It is the one where prop & int covers some but not all cases, right? (compared to the cat PowerPC model, which does not even have prop & int)

For now I am mostly worried about cases where LKMM promises too much, rather than too little. The latter case arises naturally as a trade-off between complexity of the model, algorithmic complexity, and what guarantees are actually needed from the model by real code.

By the way, I am currently investigating a formulation of LKMM where there is a separate "propagation order" per thread, prop-order[t], which relates `a` to `b` events iff `a` is "observed" to propagate to t before `b`. Observation can also include co and fr, not just rf, which might be sufficient to cover those cases. I have a hand-written proof sketch that an order ORD induced by prop-order[t], xb, and some other orders is acyclic, and a Coq proof that executing an operational model in any linearization of ORD is permitted (e.g., does not propagate a store before it is executed, or in violation of co) and has the same rf as the axiomatic execution.

So if the proof sketch works out, this might indicate that with such a per-thread propagation order, one can eliminate those cases.

But to make the definition work, I had to make xb use prop-order[t] instead of prop in some cases, and the definitions of xb and prop-order[t] are mutually recursive, so it's not a very digestible definition.

So I do not recommend replacing LKMM with such a model even if it works, but it is useful for investigating some boundary conditions.

An execution of the corresponding machine code would then be compatible
with an abstract execution of the source code in which both r0 and r1
get set to 42 (OOTA). But it would also be compatible with an abstract
execution in which both r0 and r1 are 0, so it doesn't make sense to say
that the hardware execution is, or might be, an instance of OOTA.

Yes, but this does not require the definition you expressed before. This is
already not observable according to the definition that there is no read R
from an OOTA-cycle store, where some observable side effect semantically
depends on R.

What I was trying to say is that your definition is almost a
no-true-scotsman (sorry Paul) definition:

(I would use the term "tautology" -- although I don't believe it was a
tautology in its original context, which referred specifically to cases
where the compiler had removed all the accesses in an OOTA cycle.)

every program with an OOTA
execution where you could potentially not find an "equivalent" execution
without OOTA, is simply labeled a "no-true-unobserved OOTA".

But it does make the proof of your claim totally trivial. If there is no
other OOTA-free execution with the same observable behavior, then it is
proof that the OOTA happened, so the OOTA was observed.
So by contraposition any non-observed OOTA has an OOTA-free execution with
the same observable behavior.

The sense in which I would define observed is more along the lines of "there
is an observable side effect (such as store to volatile location) which has
a semantic dependency on a load that reads from one of the stores in the
OOTA cycle".

Agreed. But maybe this indicates that your sense is too weak a
criterion, that an indirect observation should count just as much as a
direct one.

I don't follow this conclusion.
I think there are two relevant claims here:
1) compilers do not introduce "observed OOTA"
2) For every execution graph with not-observed OOTA, there is another execution with the same observable side effects that does not have OOTA.

While it may be much easier to prove 2) with a more relaxed notion of observed OOTA, 1) sounds much harder. How does the compiler know that there is no indirect way to observe the OOTA?

E.g., in the LKMM example, ignoring the compiler barriers, it might be possible for a compiler to deduce that the plain accesses are never used and delete them, resulting in an OOTA that is observed under the more relaxed setting, violating claim 1).

Best wishes,
jonas

Next message: linux: "[PATCH] timekeeping: Remove unused ktime_get_fast_timestamps"
Previous message: Andy Shevchenko: "Re: [PATCH v9 2/4] iio: accel: adxl345: initialize FIFO delay value for SPI"
In reply to: Alan Stern: "Re: [RFC] tools/memory-model: Rule out OOTA"
Next in thread: Alan Stern: "Re: [RFC] tools/memory-model: Rule out OOTA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]