On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote:
I have some additional thoughts now. It seems that you could weaken theHow is that a weakening of the operational model? It's what the
operational model by stating that an A-cumulative fence orders propagation
of all *external* stores (in addition to all po-earlier stores) that
propagated to you before the fence is executed.
operational model says right now.
For each other CPU C', any store which propagates to C before
a release fence is executed (including all po-earlier
stores executed on C) is forced to propagate to C' before the
store associated with the release fence does.
In theory, we could weaken the operational model by saying that pfences
order propagation of stores from other CPUs only when those stores are
read-from by instructions po-before the fence. But I suspect that's not
such a good idea.
It seems that on power, from an operational model perspective, there'sMaybe so. In any case, it's a moot point. In fact, I don't know if any
currently no difference between propagation fences ordering all stores vs
only external stores that propagated to the CPU before the fence is
executed, because they only have bidirectional (*->W) fences (sync, lwsync)
and not uni-directional (acquire, release), and so it is not possible for a
store that is po-later than the barrier to be executed before the barrier;
i.e., on power, every internal store that propagates to a CPU before the
fence executes is also po-earler than the fence.
If power did introduce release stores, I think you could potentially create
implementations that allow the behavior in the example you have given, but I
don't think they are the most natural ones:
architecture supporting Linux allows a write that is po-after a release
store to be reordered before the release store.
That isn't how release stores are meant to work. The read of x isP0(int *x, int *y, int *z)I could imagine that P0 posts both of its stores in a shared store buffer
{
int r1;
r1 = READ_ONCE(*x);
smp_store_release(y, 1);
WRITE_ONCE(*z, 1);
}
P1(int *x, int *y, int *z)
{
int r2;
r2 = READ_ONCE(*z);
WRITE_ONCE(*x, r2);
}
P2(int *x, int *y, int *z)
{
int r3;
int r4;
r3 = READ_ONCE(*y);
smp_rmb();
r4 = READ_ONCE(*z);
}
exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
before reading *x, but marks the release store as "not ready".
Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads,
and subsequently marks its release store as "ready".
supposed to be complete before the release store becomes visible to any
other CPU.
This is true even in C11.
Then the release store is sent to the cache, where P2 reads *y=1 and thenThis issue is one we should discuss with all the other people involved
*z=0.
Finally P0 sends its *z=1 store to the cache.
However, a perhaps more natural implementation would not post the release
store to the store buffer until it is "ready", in which case the order in
the store buffer would be *z=1 before *y=1, and in this case the release
ordering would presumably work like your current operational model.
Nevertheless, perhaps this slightly weaker operational model isn't as absurd
as it sounds. And I think many people wouldn't be shocked if the release
store didn't provide ordering with *z=1.
in maintaining the LKMM.
Alan