Re: [RFC] tools/memory-model: Rule out OOTA

From: Jonas Oberhauser
Date: Thu Jan 16 2025 - 14:09:09 EST

Next message: Stefan Wahren: "Re: [PATCH 0/5] clk: bcm: rpi: Add additional clocks for 2712."
Previous message: Namhyung Kim: "Re: [PATCH v5 3/3] perf lock: Rename fields in lock_type_table"
In reply to: Jonas Oberhauser: "Re: [RFC] tools/memory-model: Rule out OOTA"
Next in thread: Alan Stern: "Re: [RFC] tools/memory-model: Rule out OOTA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Am 1/13/2025 um 11:04 PM schrieb Paul E. McKenney:

On Fri, Jan 10, 2025 at 05:21:59PM +0100, Jonas Oberhauser wrote:

Am 1/10/2025 um 3:54 PM schrieb Paul E. McKenney:

On Thu, Jan 09, 2025 at 07:35:19PM +0100, Jonas Oberhauser wrote:

Am 1/9/2025 um 6:54 PM schrieb Paul E. McKenney:

On Wed, Jan 08, 2025 at 08:17:51PM +0100, Jonas Oberhauser wrote:

Am 1/8/2025 um 7:09 PM schrieb Paul E. McKenney:

If I change the two plain assignments to use WRITE_ONCE() as required
by memory-barriers.txt, OOTA is avoided:

I think this direction of inquiry is a bit misleading. There need not be any
speculative store at all:

P0(int *a, int *b, int *x, int *y) {
int r1;
int r2 = 0;
r1 = READ_ONCE(*x);
smp_rmb();
if (r1 == 1) {
r2 = *b;
}
WRITE_ONCE(*a, r2);
smp_wmb();
WRITE_ONCE(*y, 1);
}

P1(int *a, int *b, int *x, int *y) {
int r1;

int r2 = 0;

r1 = READ_ONCE(*y);
smp_rmb();
if (r1 == 1) {
r2 = *a;
}
WRITE_ONCE(*b, r2);
smp_wmb();
WRITE_ONCE(*x, 1);
}

If we want to respect something containing a control dependency to a
WRITE_ONCE() not in the body of the "if" statement, we need to make some
change to memory-barriers.txt.

I'm not sure what you denotate by *this* in "if we change this", but just to
clarify, I am not thinking of claiming that there were a (semantic) control
dependency to WRITE_ONCE(*b, r2) in this example.

There is however a data dependency from r2 = *a to WRITE_ONCE, and I would
say that there is a semantic data (not control) dependency from r1 =
READ_ONCE(*y) to WRITE_ONCE(*b, r2), too: depending on the value read from
*y, the value stored to *b will be different. The latter would be enough to
avoid OOTA according to the mainline LKMM, but currently this semantic
dependency is not detected by herd7.

According to LKMM, address and data dependencies must be headed by
rcu_dereference() or similar. See Documentation/RCU/rcu_dereference.rst.

Therefore, there is nothing to chain the control dependency with.

Note that herd7 does generate dependencies. And speaking informally, there
clearly is a semantic dependency.

Both the original formalization of LKMM and my patch do say that a plain
load at the head of a dependency chain does not provide any dependency
ordering, i.e.,

[Plain & R] ; dep

is never part of hb, both in LKMM and in my patch.

Agreed, LKMM does filter the underlying herd7 dependencies.

By the way, if your concern is the dependency *starting* from the plain
load, then we can look at examples where the dependency starts from a marked
load:

r1 = READ_ONCE(*x);
smp_rmb();
if (r1 == 1) {
r2 = READ_ONCE(*a);
}
*b = 1;
smp_wmb();
WRITE_ONCE(*y,1);

This is more or less analogous to the case of the addr ; [Plain] ; wmb case
you already have.

This is probably a failure of imagination on my part, but I am not
seeing how to create another thread that interacts with that store to
"b" without resulting in a data race.

The other thread is the same just flipping x/y and a/b around.

r1 = READ_ONCE(*y);
smp_rmb();
if (r1 == 1) {
r2 = READ_ONCE(*b);
}
*a = 1;
smp_wmb();
WRITE_ONCE(*x,1);

There is no data race, because the read from *b can only happen if we also read from the store to *y, and there are a wmb and an rmb to 'prevent the reordering' of the assignments to *b and *y, and the load from *y and the load from *b.

Ignoring that, I am not seeing much in the way of LKMM dependencies
in that code.

Ah, I meant to write

*b = r2

and

*a = r2

My mistake.

Note that if we just change it like this:

r1 = READ_ONCE(*x);
smp_rmb();
if (r1 == 1) {
r2 = READ_ONCE(*a);
}
*r2 = a;
smp_wmb();
WRITE_ONCE(*y,1);

then we have an addr dependency and the OOTA becomes forbidden (unless you know how to initialize *a and *b to valid addresses, you may need to add something like `if (r2 == 0) r2 = a` to run this in herd7).

It still has the same anomaly that triggers the OOTA though, where both threads can read r1==1. This anomaly goes away with my patch.

That is why sequence locks work, after all. In our internal memory model we
have relaxed the definition accordingly and there are a bunch of internally
used datastructures that can only be verified because of the relaxation.

Again, it is likely best to model sequence locking directly.

You can take that point of view. For our model, like I said, we have taken the other point of view.

The main advantage being that it helps us prove that our sequence lock and a few other speculative algorithms we have developed are correct.

I am currently not at all comfortable with the thought of allowing
plain C-language loads to head any sort of dependency. I really did put
that restriction into both memory-barriers.txt and rcu_dereference.rst
intentionally. There is the old saying "Discipline = freedom", and
therefore compilers' lack of discipline surrounding plain C-language
loads implies a lack of freedom. ;-)

Yes, I understand your concern (or more generally, the concern of letting
plain accesses play a role in ordering).
Obviously, allowing arbitrary plain loads to invoke some kind of ordering
because of a dependency is plain (heh) wrong.
There are two kinds of potential problems:
- the load or its dependent store may not exist in that location at all
- the dependency may not really exist

The second case is a problem also with marked accesses, and should be
handled by herd7 only giving us actual semantic dependencies (whatever those
are). It can not be solved in cat. Either way it is a limitation that herd7
(and also other tools) currently has and we already live with.

This is easy to say. Please see this paper (which Alan also referred to)
for some of the challenges:

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3064r2.pdf

Mark Batty and his group have identified more gotchas. (See the citation
in the above paper.)

I'm well aware that an absolute definition of semantic dependencies is not easy to give.

The point is, it's not a problem that can, in any way, be really solved inside the cat model.

So if it should be solved (instead of approximated in ways that can be easily undermined), it needs to be handled by the tooling around the cat model.

Btw, I'm re-reading that paper and here are some comments:
- I agree with the conclusion that C++ should not try to solve OOTA other than declaring that they do not exist
- the definition of OOTA as sdep | rfe is not the right one. You really need sdep ; rfe, because you can have multiple subsequent sdep links that together are not a dependency, e.g.,

int * x[] = { &a, &a };
int i = b;
*x[i] = 1;

here the semantic address dependency from loading b to loading x[i] and the semantic address dependency from loading x[i] and storing to *x[i] do not together form a semantic dependency anymore, because *x[i] is always a. So this whole code can just become a=1, and with mirrored code you can get a=b=1, which is an sdep | rfe cycle.

So compilers can absolutely generate an observed OOTA for your definition of OOTA!

As I explained in the original e-mail, you should instead rely on semantic dependency to cover cases of sdep ; sdep where there is in fact still a dependency, and define OOTA as sdep;rfe.

- I did not like the phrasing that a dependency is a function of one execution, especially in contrast to source code. For a fixed execution, there are no dependencies because all values are fixed. It is the other executions - where you could have read another value - that create the dependencies.

Perhaps it is better to say that it is not a *local* function of source code, i.e., just because the same source code has a dependency in context C does not mean that it has a dependency in context C'.

In fact I would say a major gotcha of dependencies is that they are not a function of even the set of all permissible (according to the standard) executions.
That is because the compiler does not have to make every permissible execution possible, only at least one.

If the compiler knows that among all executions that it actually makes possible, some value is always 0 - even if there is a permissible execution in which it is not 0 - it can still replace that value with 0. E.g.

T1 { x = 1; x = 0; }
T2 { T[x] = 1; }

~~ merge -- makes executions where T[x] reads x==1 impossible ~>

T1' { x = 1; x = 0; T[x] = 1; }

~~ replace ~>

T1' { x = 1; x = 0; T[0] = 1; // no more dependency :( }

which defeats any semantic dependency definition that is a function of a single execution.

Of course you avoid this by making it really a function of the compiler + program *and* the execution, and looking at all the other possible (not just permissible) executions.

- On the one hand, that definition makes a lot of sense. On the other hand, at least without the atomics=volatile restriction it would have the downside that a compiler which generates just a single execution for your program can say that there are no dependencies whatsoever and generate all kinds of "out of thin air" behaviors.
I am not sure if that gets really resolved by the volatile restrictions you put, but either way those seem far stronger than what one would want.

I would say that the approach with volatile is overzealous because it tries to create a "local" order solution to the problem that only requires a "global" ordering solution. Since not every semantic dependency needs to provide order in C++ -- only the cycle of dependencies -- it is totally ok to add too many semantic dependency edges to a program, even those that are not going to be exactly maintained by every compiler, as long as we can ensure that globally, no dependency cycle occurs.

So for example, if we merge x = y || y = x, the merge can turn it into x=y=x or y=x=y (and then into an empty program), but not into a cyclic dependency. So even though one side of the dependency may be violated, for sake of OOTA, we could still label both sides as dependent.

For LKMM the problem is of course much easier because you have volatiles and compiler barriers. Again you could maybe add incorrect semantic dependencies between accesses, as long as only the really preserved ones will imply ordering.

So I'm not convinced that for either of the two cases you need to do a compiler-specific definition of dependencies.

BTW, for what it's worth, Dat3M in a sense uses the clang dependencies - it first allows the compiler to do its optimizations, and then verifies the llvm-ir (with a more hardware-like dependency definition).

I think something like that can be a good practical solution with fewer problems than other attempts to approximate the solution.

So the new problem we deal with is to somehow restrict the rule to loads and
dependent stores that the compiler for whatever reason will not fully
eliminate.

This problem too can not be solved completely inside cat. We can give an
approximation, as discussed with Alan (stating that a store would not be
elided if it is read by another thread, and a read would not be elided if it
reads from another thread and a store that won't be elided depends on it).

This approximation is also limited, e.g., if the addresses of the plain
loads and stores have not yet escaped the function, but at least this
scenario is currently impossible in herd7 (unlike the fake dependency
scenario).

In my mind it would again be better to offload the correct selection of
"compiler-un(re)movable plain loads and stores" to the tools. That may again
not solve the problem fully, but it at least would mean that any changes to
address the imprecision wouldn't need to go through the kernel tree, and
IMHO it is easier to say LKMM in the cat files is the model, and the
interpretation of the model has some limitations.

Which we currently do via marked accesses. ;-)

Hey, programmers are not tools : )

We should decide which of these examples should be
added to the github litmus archive, perhaps to illustrate the fact that
plain C-language loads do not head dependency chains. Thoughts?

I'm not sure that is a good idea, given that running the herd7 tool on the
litmus test will clearly show a dependency chain headed by plain loads in
the visual graphs (with `doshow rwdep`).

Maybe it makes more sense to say in the docs that they may head syntactic or
semantic dependency chains, but because of the common case that the compiler
may cruelly optimize things, LKMM does not guarantee ordering based on the
dependency chains headed by plain loads. That would be consistent with the
tooling.

Well, LKMM and herd7 have different notions of what constitutes a
dependency, and the LKMM notion is most relevant here. (And herd7
needs its broader definition so as to handle a wide variety of
memory models.)

I would say that herd7's definition handles a wide variety of hardware memory models, but it has limitations for language level models. Perhaps it would be good for herd7 to support multiple dependency models as well.

Have fun,
jonas

Next message: Stefan Wahren: "Re: [PATCH 0/5] clk: bcm: rpi: Add additional clocks for 2712."
Previous message: Namhyung Kim: "Re: [PATCH v5 3/3] perf lock: Rename fields in lock_type_table"
In reply to: Jonas Oberhauser: "Re: [RFC] tools/memory-model: Rule out OOTA"
Next in thread: Alan Stern: "Re: [RFC] tools/memory-model: Rule out OOTA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]