Re: Documentation for plain accesses and data races

From: Paul E. McKenney
Date: Thu Sep 12 2019 - 18:01:45 EST


On Fri, Sep 06, 2019 at 02:11:29PM -0400, Alan Stern wrote:
> Folks:
>
> I have spent some time writing up a section for
> tools/memory-model/Documentation/explanation.txt on plain accesses and
> data races. The initial version is below.
>
> I'm afraid it's rather long and perhaps gets too bogged down in
> complexities. On the other hand, this is a complicated topic so to
> some extent this is unavoidable.
>
> In any case, I'd like to hear your comments and reviews.

Good stuff, thank you for putting this together!

Please see below for some questions, comments, and confusion interspersed.

> Alan
>
> ------------------------------------------------------------------------
>
>
> PLAIN ACCESSES AND DATA RACES
> -----------------------------
>
> In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y),
> smp_load_acquire(&z), and so on are collectively referred to as
> "marked" accesses, because they are all annotated with special
> operations of one kind or another. Ordinary C-language memory
> accesses such as x or y = 0 are simply called "plain" accesses.
>
> Early versions of the LKMM had nothing to say about plain accesses.
> The C standard allows compilers to assume that the variables affected
> by plain accesses are not concurrently read or written by any other
> threads or CPUs. This leaves compilers free to implement all manner
> of transformations or optimizations of code containing plain accesses,
> making such code very difficult for a memory model to handle.
>
> Here is just one example of a possible pitfall:
>
> int a = 6;
> int *x = &a;
>
> P0()
> {
> int *r1;
> int r2 = 0;
>
> r1 = x;
> if (r1 != NULL)
> r2 = READ_ONCE(*r1);
> }
>
> P1()
> {
> WRITE_ONCE(x, NULL);
> }

I tried making a litmus test out of this:

------------------------------------------------------------------------
C plain-1

{
int a = 6;
int *x = &a;
}

P0(int **x)
{
int *r1;
int r2 = 0;

r1 = *x;
if (r1 != 0)
r2 = READ_ONCE(*r1);
}

P1(int **x)
{
WRITE_ONCE(*x, 0);
}

locations [a; x; r1]
exists ~r2=6 /\ ~r2=0
------------------------------------------------------------------------

However, r1 steadfastly refuses to have any value other than zero.

------------------------------------------------------------------------
$ herd7 -conf linux-kernel.cfg /tmp/argh
Test plain-1 Allowed
States 1
a=6; r1=0; r2=0; x=0;
No
Witnesses
Positive: 0 Negative: 2
Flag data-race
Condition exists (not (r2=6) /\ not (r2=0))
Observation plain-1 Never 0 2
Time plain-1 0.00
Hash=b0fdbd0f627fd65e0cd413bf87f6f4a4
------------------------------------------------------------------------

What am I doing wrong here? Outdated herd7 version?

$ herd7 -version
7.52+7(dev), Rev: c81f1ff06f30d5c28c34d893a29f5f8505334179

Hmmm... I might well be in an inconsistent herd7/ocaml state. If no
one sees anything obvious, I will try reinstalling from scratch, but
that will not likely happen in the next few days.

> On the face of it, one would expect that when this code runs, the only
> possible final values for r2 are 6 and 0, depending on whether or not
> P1's store to x propagates to P0 before P0's load from x executes.
> But since P0's load from x is a plain access, the compiler may decide
> to carry out the load twice (for the comparison against NULL, then again
> for the READ_ONCE()) and eliminate the temporary variable r1. The
> object code generated for P0 could therefore end up looking rather
> like this:
>
> P0()
> {
> int r2 = 0;
>
> if (x != NULL)
> r2 = READ_ONCE(*x);
> }
>
> And now it is obvious that this code runs the risk of dereferencing a
> NULL pointer, because P1's store to x might propagate to P0 after the
> test against NULL has been made but before the READ_ONCE() executes.
> If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x",
> the compiler would not have performed this optimization and there
> would be no possibility of a NULL-pointer dereference.
>
> Given the possibility of transformations like this one, the LKMM
> doesn't try to predict all possible outcomes of code containing plain
> accesses. It is content to determine whether the code violates the

I suggest starting this sentence with something like "It is instead
content to determine", adding "instead", to help the reader transition.

> compiler's assumptions, which would render the ultimate outcome
> undefined.
>
> In technical terms, the compiler is allowed to assume that when the
> program executes, there will not be any data races. A "data race"
> occurs when two conflicting memory accesses execute concurrently;
> two memory accesses "conflict" if:
>
> they access the same location,
>
> they occur on different CPUs (or in different threads on the
> same CPU),
>
> at least one of them is a plain access,
>
> and at least one of them is a store.
>
> The LKMM tries to determine whether a program contains two conflicting
> accesses which may execute concurrently; if it does then the LKMM says
> there is a potential data race and makes no predictions about the
> program's outcome.
>
> Determining whether two accesses conflict is easy; you can see that
> all the concepts involved in the definition above are already part of
> the memory model. The hard part is telling whether they may execute
> concurrently. The LKMM takes a conservative attitude, assuming that
> accesses may be concurrent unless it can prove they cannot.
>
> If two memory accesses aren't concurrent then one must execute before

Should this say "If two memory accesses to the same location aren't
concurrent..."?

> the other. Therefore the LKMM decides two accesses aren't concurrent
> if they can be connected by a sequence of hb, pb, and rb links
> (together referred to as xb, for "executes before"). However, there
> are two complicating factors.
>
> If X is a load and X executes before a store Y, then indeed there is
> no danger of X and Y being concurrent. After all, Y can't have any
> effect on the value obtained by X until the memory subsystem has
> propagated Y from its own CPU to X's CPU, which won't happen until
> some time after Y executes and thus after X executes. But if X is a
> store, then even if X executes before Y it is still possible that X
> will propagate to Y's CPU just as Y is executing. In such a case X
> could very well interfere somehow with Y, and we would have to
> consider X and Y to be concurrent.
>
> Therefore when X is a store, for X and Y to be non-concurrent the LKMM
> requires not only that X must execute before Y but also that X must
> propagate to Y's CPU before Y executes. (Or vice versa, of course, if
> Y executes before X -- then Y must propagate to X's CPU before X
> executes if Y is a store.) This is expressed by the visibility
> relation (vis), where X ->vis Y is defined to hold if there is an
> intermediate event Z such that:

"if there is a marked intermediate event Z such that"?

> X is connected to Z by a possibly empty sequence of
> cumul-fence links followed by an optional rfe link (if none of
> these links are present, X and Z are the same event),
>
> and either:
>
> Z is connected to Y by a strong-fence link followed by a
> possibly empty sequence of xb links,

"possibly empty sequence of xb links from a marked access"?

> or:
>
> Z is on the same CPU as Y and is connected to Y by a possibly
> empty sequence of xb links (again, if the sequence is empty it
> means Z and Y are the same event).
>
> The motivations behind this definition are straightforward:
>
> cumul-fence memory barriers force stores that are po-before
> the barrier to propagate to other CPUs before stores that are
> po-after the barrier.
>
> An rfe link from an event W to an event R says that R reads
> from W, which certainly means that W must have propagated to
> R's CPU before R executed.
>
> strong-fence memory barriers force stores that are po-before
> the barrier, or that propagate to the barrier's CPU before the
> barrier executes, to propagate to all CPUs before any events
> po-after the barrier can execute.
>
> To see how this works out in practice, consider our old friend, the MP
> pattern (with fences and statement labels, but without the conditional
> test):
>
> int buf = 0, flag = 0;
>
> P0()
> {
> X: WRITE_ONCE(buf, 1);
> smp_wmb();
> W: WRITE_ONCE(flag, 1);
> }
>
> P1()
> {
> int r1;
> int r2 = 0;
>
> Z: r1 = READ_ONCE(flag);
> smp_rmb();
> Y: r2 = READ_ONCE(buf);
> }

I have to ask. Why X then W then Z then Y? ;-)

(This is MP+fencewmbonceonce+fencermbonceonce.litmus in the current set
in tools/memory-model/litmus-tests.)

> The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
> assuming r1 = 1 at the end, there is an rfe link from W to Z. This
> means that the store to buf must propagate from P0 to P1 before Z
> executes. Next, Z and Y are on the same CPU and the smp_rmb() fence
> provides an xb link from Z to Y (i.e., it forces Z to execute before
> Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
> executes.
>
> The second complicating factor mentioned above arises from the fact
> that when we are considering data races, some of the memory accesses
> are plain. Now, although we have not said so explicitly, up to this
> point most of the relations defined by the LKMM (ppo, hb, prop,
> cumul-fence, pb, and so on -- including vis) apply only to marked
> accesses.
>
> There are good reasons for this restriction. The compiler is not
> allowed to apply fancy transformations to marked accesses, and
> consequently each such access in the source code corresponds more or
> less directly to a single machine instruction in the object code. But
> plain accesses are a different story; the compiler may combine them,
> split them up, duplicate them, eliminate them, invent new ones, and
> who knows what else. Seeing a plain access in the source code tells
> you almost nothing about what machine instructions will end up in the
> object code.
>
> Fortunately, the compiler isn't completely free; it is subject to some
> limitations. For one, it is not allowed to introduce a data race into
> the object code if the source code does not already contain a data
> race (if it could, memory models would be useless and no multithreaded
> code would be safe!). For another, it cannot move a plain access past
> a compiler barrier.
>
> A compiler barrier is a kind of fence, but as the name implies, it
> only affects the compiler; it does not necessarily have any effect on
> how instructions are executed by the CPU. In Linux kernel source
> code, the barrier() function is a compiler barrier. It doesn't give
> rise directly to any machine instructions in the object code; rather,
> it affects how the compiler generates the rest of the object code.
> Given source code like this:
>
> ... some memory accesses ...
> barrier();
> ... some other memory accesses ...
>
> the barrier() function ensures that the machine instructions
> corresponding to the first group of accesses will all end po-before
> any machine instructions corresponding to the second group of accesses
> -- even if some of the accesses are plain. (Of course, the CPU may
> then execute some of those accesses out of program order, but we
> already know how to deal with such issues.) Without the barrier()
> there would be no such guarantee; the two groups of accesses could be
> intermingled or even reversed in the object code.
>
> The LKMM doesn't say much about the barrier() function, but it does
> require that all fences are also compiler barriers. In addition, it
> requires that the ordering properties of memory barriers such as
> smp_rmb() or smp_store_release() apply to plain accesses as well as to
> marked accesses.
>
> This is the key to analyzing data races. Consider the MP pattern
> again, now using plain accesses for buf:
>
> int buf = 0, flag = 0;
>
> P0()
> {
> U: buf = 1;
> smp_wmb();
> X: WRITE_ONCE(flag, 1);
> }
>
> P1()
> {
> int r1;
> int r2 = 0;
>
> Y: r1 = READ_ONCE(flag);
> if (r1) {
> smp_rmb();
> V: r2 = buf;
> }
> }

And same nit, why not just X, Y, and Z?

Similar issues with the litmus test:

------------------------------------------------------------------------
C plain-4

{
int buf = 0;
int flag = 0;
}


P0(int *buf, int *flag)
{
*buf = 1;
smp_wmb();
WRITE_ONCE(*flag, 1);
}

P1(int *buf, int *flag)
{
int r1;
int r2 = 0;

r1 = READ_ONCE(*flag);
if (r1) {
smp_rmb();
r2 = *buf;
}
}

exists r1=1 /\ r2=0
------------------------------------------------------------------------

> This program does not contain a data race. Although the U and V
> accesses conflict, the LKMM can prove they are not concurrent as
> follows:
>
> The smp_wmb() fence in P0 is both a compiler barrier and a
> cumul-fence. It guarantees that no matter what hash of
> machine instructions the compiler generates for the plain
> access U, all those instructions will be po-before the fence.
> Consequently U's store to buf, no matter how it is carried out
> at the machine level, must propagate to P1 before X's store to
> flag does.
>
> X and Y are both marked accesses. Hence an rfe link from X to
> Y is a valid indicator that X propagated to P1 before Y
> executed, i.e., X ->vis Y. (And if there is no rfe link then
> r1 will be 0, so V will not be executed and ipso facto won't
> race with U.)
>
> The smp_rmb() fence in P1 is a compiler barrier as well as a
> fence. It guarantees that all the machine-level instructions
> corresponding to the access V will be po-after the fence, and
> therefore any loads among those instructions will execute
> after the fence does and hence after Y does.
>
> Thus U's store to buf is forced to propagate to P1 before V's load
> executes (assuming V does execute), ruling out the possibility of a
> data race between them.
>
> This analysis illustrates how the LKMM deals with plain accesses in
> general. Suppose R is a plain load and we want to show that R
> executes before some marked access E. We can do this by finding a
> marked access X such that R and X are ordered by a suitable fence and
> X ->xb* E. If E was also a plain access, we would also look for a
> marked access Y such that X ->xb* Y, and Y and E are ordered by a
> fence. We describe this arrangement by saying that R is
> "post-bounded" by X and E is "pre-bounded" by Y.
>
> In fact, we go one step further: Since R is a read, we say that R is
> "r-post-bounded" by X. Similarly, E would be "r-pre-bounded" or
> "w-pre-bounded" by Y, depending on whether E was a store or a load.
> This distinction is needed because some fences affect only loads
> (i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise
> the two types of bounds are the same. And as a degenerate case, we
> say that a marked access pre-bounds and post-bounds itself (e.g., if R
> above were a marked load then X could simply be taken to be R itself.)
>
> The need to distinguish between r- and w-bounding raises yet another
> issue. When the source code contains a plain store, the compiler is
> allowed to put plain loads of the same location into the object code.
> For example, given the source code:
>
> x = 1;
>
> the compiler is theoretically allowed to generate object code that
> looks like:
>
> if (x != 1)
> x = 1;
>
> thereby adding a load (and possibly replacing the store entirely).
> For this reason, whenever the LKMM requires a plain store to be
> w-pre-bounded or w-post-bounded by a marked access, it also requires
> the store to be r-pre-bounded or r-post-bounded, so as to handle cases
> where the compiler adds a load.
>
> (This may be overly cautious. We don't know of any examples where a
> compiler has augmented a store with a load in this fashion, and the
> Linux kernel developers would probably fight pretty hard to change a
> compiler if it ever did this. Still, better safe than sorry.)
>
> Incidentally, the other tranformation -- augmenting a plain load by
> adding in a store to the same location -- is not allowed. This is
> because the compiler cannot know whether any other CPUs might perform
> a concurrent load from that location. Two concurrent loads don't
> constitute a race (they can't interfere with each other), but a store
> does race with a concurrent load. Thus adding a store might create a
> data race where one was not already present in the source code,
> something the compiler is forbidden to do. Augmenting a store with a
> load, on the other hand, is acceptable because doing so won't create a
> data race unless one already existed.
>
> The LKMM includes a second way to pre-bound plain accesses, in
> addition to fences: an address dependency from a marked load. That
> is, in the sequence:
>
> p = READ_ONCE(ptr);
> r = *p;
>
> the LKMM says that the marked load of ptr pre-bounds the plain load of
> *p; the marked load must execute before any of the machine
> instructions corresponding to the plain load. This is a reasonable
> stipulation, since after all, the CPU can't perform the load of *p
> until it knows what value p will hold. Furthermore, without some
> assumption like this one, some usages typical of RCU would count as
> data races. For example:
>
> int a = 1, b;

herd7 doesn't much like this.

> int *ptr = &a;
>
> P0()
> {
> b = 2;
> rcu_assign_ptr(ptr, &b);

rcu_assign_pointer(), globally.

> }
>
> P1()
> {
> int *p;
> int r;
>
> rcu_read_lock();
> p = rcu_dereference(ptr);
> r = *p;
> rcu_read_unlock();
> }

------------------------------------------------------------------------
C plain-5

{
int a = 1;
int b;
int *ptr = &a;
}

P0(int *b, int **ptr)
{
*b = 2;
rcu_assign_pointer(*ptr, b);
}

P1(int *ptr)
{
int *r1;
int r2;

rcu_read_lock();
r1 = rcu_dereference(*ptr);
r2 = *r1;
rcu_read_unlock();
}

exists r1=b /\ r2=1
------------------------------------------------------------------------

Also strange output:

------------------------------------------------------------------------
$ herd7 -conf linux-kernel.cfg /tmp/argh
Test plain-3 Allowed
States 1
r1=0; r2=0;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (r1=b /\ r2=1)
Observation plain-3 Never 0 2
Time plain-3 0.00
Hash=44e039597a92bdc7efc9217813478126
------------------------------------------------------------------------

> (In this example the rcu_read_lock() and rcu_read_unlock() calls don't
> really do anything, because there aren't any grace periods. They are
> included merely for the sake of good form; typically P0 would call
> synchronize_rcu() somewhere after the rcu_assign_ptr().)
>
> rcu_assign_ptr() performs a store-release, so the plain store to b is
> definitely w-post-bounded before the store to ptr, and the two stores
> will propagate to P1 in that order. However, rcu_dereference() is
> only equivalent to READ_ONCE(). While it is a marked access, it is
> not a fence or compiler barrier. Hence the only guarantee we have
> that the load of ptr in P1 is r-pre-bounded before the load of *p
> (thus avoiding a race) is the assumption about address dependencies.
>
> This is a situation where the compiler can undermine the memory model,
> and a certain amount of care is required when programming constructs
> like this one. In particular, comparisons between the pointer and
> other known addresses can cause trouble. If you have something like:
>
> p = rcu_dereference(ptr);
> if (p == &x)
> r = *p;
>
> then the compiler just might generate object code resembling:
>
> p = rcu_dereference(ptr);
> if (p == &x)
> r = x;
>
> or even:
>
> rtemp = x;
> p = rcu_dereference(ptr);
> if (p == &x)
> r = rtemp;
>
> which would invalidate the memory model's assumption, since the CPU
> could now perform the load of x before the load of ptr (there might be
> a control dependency but no address dependency at the machine level).
>
> Finally, it turns out there is a situation in which a plain write does
> not need to be w-post-bounded: when it is separated from the
> conflicting access by a fence. At first glance this may seem
> impossible. After all, to be conflicting the second access has to be
> on a different CPU from the first, and fences don't link events on
> different CPUs. Well, normal fences don't -- but rcu-fence can!
> Here's an example:
>
> int x, y;
>
> P0()
> {
> WRITE_ONCE(x, 1);
> synchronize_rcu();
> y = 3;
> }
>
> P1()
> {
> rcu_read_lock();
> if (READ_ONCE(x) == 0)
> y = 2;
> rcu_read_unlock();
> }

I introduced an r1 to create an "exists" clause:

------------------------------------------------------------------------
C plain-6

{
int x;
int y;
}


P0(int *x, int *y)
{
WRITE_ONCE(*x, 1);
synchronize_rcu();
*y = 3;
}

P1(int *x, int *y)
{
int r1;

rcu_read_lock();
r1 = READ_ONCE(*x);
if (r1 == 0)
*y = 2;
rcu_read_unlock();
}

exists r1=0 /\ y=2
------------------------------------------------------------------------

> Do the plain stores to y race? Clearly not if P1 reads a non-zero
> value for x, so let's assume the READ_ONCE(x) does obtain 0. This
> means that the read-side critical section in P1 must finish executing
> before the grace period in P0 does, because RCU's Grace-Period
> Guarantee says that otherwise P0's store to x would have propagated to
> P1 before the critical section started and so would have been visible
> to the READ_ONCE(). (Another way of putting it is that the fre link
> from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link
> between those two events.)
>
> This means there is an rcu-fence link from P1's "y = 2" store to P0's
> "y = 3" store, and consequently the first must propagate from P1 to P0
> before the second can execute. Therefore the two stores cannot be
> concurrent and there is no race, even though P1's plain store to y
> isn't w-post-bounded by any marked accesses.
>
> Putting all this material together yields the following picture. For
> two conflicting stores W and W', where W ->co W', the LKMM says the
> stores don't race if W can be linked to W' by a
>
> w-post-bounded ; vis ; w-pre-bounded
>
> sequence. If W is plain then they also have to be linked by an
>
> r-post-bounded ; xb* ; w-pre-bounded
>
> sequence, and if W' is plain then they also have to be linked by a
>
> w-post-bounded ; vis ; r-pre-bounded
>
> sequence. For a conflicting load R and store W, the LKMM says the two
> accesses don't race if R can be linked to W by an
>
> r-post-bounded ; xb* ; w-pre-bounded
>
> sequence or if W can be linked to R by a
>
> w-post-bounded ; vis ; r-pre-bounded
>
> sequence. For the cases involving a vis link, the LKMM also accepts
> sequences in which W is linked to W' or R by a
>
> strong-fence ; xb* ; {w and/or r}-pre-bounded
>
> sequence with no post-bounding, and in every case the LKMM also allows
> the link simply to be a fence with no bounding at all. If no sequence
> of the appropriate sort exists, the LKMM says that the accesses race.
>
> There is one more part of the LKMM related to plain accesses (although
> not to data races) we should discuss. Recall that many relations such
> as hb are limited to marked accesses only. As a result, the
> happens-before, propagates-before, and rcu axioms (which state that
> various relation must not contain a cycle) doesn't apply to plain
> accesses. Nevertheless, we do want to rule out such cycles, because
> they don't make sense even for plain accesses.
>
> To this end, the LKMM imposes three extra restrictions, together
> called the "plain-coherence" axiom because of their resemblance to the
> coherency rules:
>
> If R and W conflict and it is possible to link R to W by one
> of the xb* sequences listed above, then W ->rfe R is not
> allowed (i.e., a load cannot read from a store that it
> executes before, even if one or both is plain).
>
> If W and R conflict and it is possible to link W to R by one
> of the vis sequences listed above, then R ->fre W is not
> allowed (i.e., if a store is visible to a load then the load
> must read from that store or one coherence-after it).
>
> If W and W' conflict and it is possible to link W to W' by one
> of the vis sequences listed above, then W' ->co W is not
> allowed (i.e., if one store is visible to another then it must
> come after in the coherence order).
>
> This is the extent to which the LKMM deals with plain accesses.
> Perhaps it could say more (for example, plain accesses might
> contribute to the ppo relation), but at the moment it seems that this
> minimal, conservative approach is good enough.

I will need to read this last section again. Perhaps more than once. ;-)

Anyway, good stuff!!!

Thanx, Paul