Re: [RFC] tools/memory-model: Rule out OOTA

From: Jonas Oberhauser
Date: Wed Jan 08 2025 - 14:22:36 EST




Am 1/8/2025 um 7:47 PM schrieb Alan Stern:
On Wed, Jan 08, 2025 at 06:33:05PM +0100, Jonas Oberhauser wrote:


Am 1/7/2025 um 5:09 PM schrieb Alan Stern:
Is this really valid? In the example above, if there were no other
references to a or b in the rest of the program, the compiler could
eliminate them entirely.

In the example above, this is not possible, because the address of a/b have
escaped the function and are not deallocated before synchronization happens.
Therefore the compiler must assume that a/b are accessed inside the compiler
barrier.

I'm not quite sure what you mean by that, but if the compiler has access
to the entire program and can do a global analysis then it can recognize
cases where accesses that may seem to be live aren't really.

Even for trivial enough cases where the compiler has access to all the source, compiler barriers should be opaque to the compiler.

Since it is opaque,

*a = 1;
compiler_barrier();

might as well be

*a = 1;
*d = *a; // *d is in device memory

and so in my opinion the compiler needs to ensure that the value of *a right before the compiler barrier is 1.

Of course, only if the address of *a could be possibly legally known to the opaque code in the compiler barrier.



However, I admit this objection doesn't really apply to Linux kernel
programming.

(Whether the result could count as OOTA is
open to question, but that's not the point.) Is it not possible that a
compiler might find other ways to defeat your intentions?

The main point (which I already mentioned in the previous e-mail) is if the
object is deallocated without synchronization (or never escapes the function
in the first place).

And indeed, any such case renders the added rule unsound. It is in a sense
unrelated to OOTA; cases where the load/store can be elided are never OOTA.

That is a matter of definition. In our paper, Paul and I described
instances of OOTA in which all the accesses have been optimized away as
"trivial".

Yes, by OOTA I mean a rwdep;rfe cycle.

In the absence of data races, such a cycle can't be optimized away because it is created with volatile/compiler-barrier-protected accesses.


If an access interacts with an access of another thread (by reading from it
or being read from it), it must be live.

This is the sort of approximation I'm a little uncomfortable with. It
would be better to say that a store which is read from by a live load
must be live. I don't see why a load which reads from a live store has
to be live.

You are right, and I was careless.
All we need is that a store that is read externally by a live load is live, and that a load that reads from an external store and has its value semantically depended-on by a live store is live.


The formulation in the patch is just based on a complicated and close but
imperfect approximation of Live.

Maybe you can reformulate the patch to make this more explicit.


It would look something like this:

Live = R & rng(po \ po ; [W] ; (po-loc \ w_barrier)) | W & dom(po \ ((po-loc \ w_barrier) ; [W] ; po))

let to-w = (overwrite & int) | (addr | rmb ; [Live])? ; rwdep ; ([Live] ; wmb)?


In any case, it seems that any approximation we can make to Live will be
subject to various sorts of errors.

Probably (this is certainly true for trying to approximate dependencies, for example), but what I know for certain is that the approximations of Live inside cat get more ugly the more precise they become. In the above definition of Live I have not included that the address must escape, nor that it must not be freed.

A non-local definition that suffices for OOTA would be so:

Live = R & rng(rfe) & dom(rwdep ; rfe) | W & dom(rfe)


It seems the ideal solution is to let Live be defined by the tools, which should keep up with or exceed the analysis done by state-of-art compilers.

Best wishes,
jonas