[GIT PULL] locking fixes
From: Ingo Molnar
Date: Thu Mar 24 2016 - 03:47:28 EST
Linus,
Please pull the latest locking-urgent-for-linus git tree from:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking-urgent-for-linus
# HEAD: f75d48644c56a31731d17fa693c8175328957e1d bitops: Do not default to __clear_bit() for __clear_bit_unlock()
Documentation updates and a bitops ordering fix.
Thanks,
Ingo
------------------>
Paul E. McKenney (7):
documentation: Fix control dependency and identical stores
documentation: Fix memory-barriers.txt section references
documentation: Remove obsolete reference to RCU-protected indexes
documentation: Subsequent writes ordered by rcu_dereference()
documentation: Distinguish between local and global transitivity
documentation: Add alternative release-acquire outcome
documentation: Transitivity is not cumulativity
Peter Zijlstra (1):
bitops: Do not default to __clear_bit() for __clear_bit_unlock()
SeongJae Park (1):
documentation: Clarify compiler store-fusion example
Documentation/memory-barriers.txt | 141 +++++++++++++++++++++++++++++++-------
include/asm-generic/bitops/lock.h | 14 ++--
2 files changed, 123 insertions(+), 32 deletions(-)
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 904ee42d078e..3729cbe60e41 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -232,7 +232,7 @@ GUARANTEES
with memory references that are not protected by READ_ONCE() and
WRITE_ONCE(). Without them, the compiler is within its rights to
do all sorts of "creative" transformations, which are covered in
- the Compiler Barrier section.
+ the COMPILER BARRIER section.
(*) It _must_not_ be assumed that independent loads and stores will be issued
in the order given. This means that for:
@@ -555,6 +555,30 @@ To deal with this, a data dependency barrier or better must be inserted
This enforces the occurrence of one of the two implications, and prevents the
third possibility from arising.
+A data-dependency barrier must also order against dependent writes:
+
+ CPU 1 CPU 2
+ =============== ===============
+ { A == 1, B == 2, C = 3, P == &A, Q == &C }
+ B = 4;
+ <write barrier>
+ WRITE_ONCE(P, &B);
+ Q = READ_ONCE(P);
+ <data dependency barrier>
+ *Q = 5;
+
+The data-dependency barrier must order the read into Q with the store
+into *Q. This prohibits this outcome:
+
+ (Q == B) && (B == 4)
+
+Please note that this pattern should be rare. After all, the whole point
+of dependency ordering is to -prevent- writes to the data structure, along
+with the expensive cache misses associated with those writes. This pattern
+can be used to record rare error conditions and the like, and the ordering
+prevents such records from being lost.
+
+
[!] Note that this extremely counterintuitive situation arises most easily on
machines with split caches, so that, for example, one cache bank processes
even-numbered cache lines and the other bank processes odd-numbered cache
@@ -565,21 +589,6 @@ odd-numbered bank is idle, one can see the new value of the pointer P (&B),
but the old value of the variable B (2).
-Another example of where data dependency barriers might be required is where a
-number is read from memory and then used to calculate the index for an array
-access:
-
- CPU 1 CPU 2
- =============== ===============
- { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
- M[1] = 4;
- <write barrier>
- WRITE_ONCE(P, 1);
- Q = READ_ONCE(P);
- <data dependency barrier>
- D = M[Q];
-
-
The data dependency barrier is very important to the RCU system,
for example. See rcu_assign_pointer() and rcu_dereference() in
include/linux/rcupdate.h. This permits the current target of an RCU'd
@@ -800,9 +809,13 @@ site: https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html.
use smp_rmb(), smp_wmb(), or, in the case of prior stores and
later loads, smp_mb().
- (*) If both legs of the "if" statement begin with identical stores
- to the same variable, a barrier() statement is required at the
- beginning of each leg of the "if" statement.
+ (*) If both legs of the "if" statement begin with identical stores to
+ the same variable, then those stores must be ordered, either by
+ preceding both of them with smp_mb() or by using smp_store_release()
+ to carry out the stores. Please note that it is -not- sufficient
+ to use barrier() at beginning of each leg of the "if" statement,
+ as optimizing compilers do not necessarily respect barrier()
+ in this case.
(*) Control dependencies require at least one run-time conditional
between the prior load and the subsequent store, and this
@@ -814,7 +827,7 @@ site: https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html.
(*) Control dependencies require that the compiler avoid reordering the
dependency into nonexistence. Careful use of READ_ONCE() or
atomic{,64}_read() can help to preserve your control dependency.
- Please see the Compiler Barrier section for more information.
+ Please see the COMPILER BARRIER section for more information.
(*) Control dependencies pair normally with other types of barriers.
@@ -1257,7 +1270,7 @@ TRANSITIVITY
Transitivity is a deeply intuitive notion about ordering that is not
always provided by real computer systems. The following example
-demonstrates transitivity (also called "cumulativity"):
+demonstrates transitivity:
CPU 1 CPU 2 CPU 3
======================= ======================= =======================
@@ -1305,8 +1318,86 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes.
General barriers are therefore required to ensure that all CPUs agree
on the combined order of CPU 1's and CPU 2's accesses.
-To reiterate, if your code requires transitivity, use general barriers
-throughout.
+General barriers provide "global transitivity", so that all CPUs will
+agree on the order of operations. In contrast, a chain of release-acquire
+pairs provides only "local transitivity", so that only those CPUs on
+the chain are guaranteed to agree on the combined order of the accesses.
+For example, switching to C code in deference to Herman Hollerith:
+
+ int u, v, x, y, z;
+
+ void cpu0(void)
+ {
+ r0 = smp_load_acquire(&x);
+ WRITE_ONCE(u, 1);
+ smp_store_release(&y, 1);
+ }
+
+ void cpu1(void)
+ {
+ r1 = smp_load_acquire(&y);
+ r4 = READ_ONCE(v);
+ r5 = READ_ONCE(u);
+ smp_store_release(&z, 1);
+ }
+
+ void cpu2(void)
+ {
+ r2 = smp_load_acquire(&z);
+ smp_store_release(&x, 1);
+ }
+
+ void cpu3(void)
+ {
+ WRITE_ONCE(v, 1);
+ smp_mb();
+ r3 = READ_ONCE(u);
+ }
+
+Because cpu0(), cpu1(), and cpu2() participate in a local transitive
+chain of smp_store_release()/smp_load_acquire() pairs, the following
+outcome is prohibited:
+
+ r0 == 1 && r1 == 1 && r2 == 1
+
+Furthermore, because of the release-acquire relationship between cpu0()
+and cpu1(), cpu1() must see cpu0()'s writes, so that the following
+outcome is prohibited:
+
+ r1 == 1 && r5 == 0
+
+However, the transitivity of release-acquire is local to the participating
+CPUs and does not apply to cpu3(). Therefore, the following outcome
+is possible:
+
+ r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
+
+As an aside, the following outcome is also possible:
+
+ r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1
+
+Although cpu0(), cpu1(), and cpu2() will see their respective reads and
+writes in order, CPUs not involved in the release-acquire chain might
+well disagree on the order. This disagreement stems from the fact that
+the weak memory-barrier instructions used to implement smp_load_acquire()
+and smp_store_release() are not required to order prior stores against
+subsequent loads in all cases. This means that cpu3() can see cpu0()'s
+store to u as happening -after- cpu1()'s load from v, even though
+both cpu0() and cpu1() agree that these two operations occurred in the
+intended order.
+
+However, please keep in mind that smp_load_acquire() is not magic.
+In particular, it simply reads from its argument with ordering. It does
+-not- ensure that any particular value will be read. Therefore, the
+following outcome is possible:
+
+ r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
+
+Note that this outcome can happen even on a mythical sequentially
+consistent system where nothing is ever reordered.
+
+To reiterate, if your code requires global transitivity, use general
+barriers throughout.
========================
@@ -1459,7 +1550,7 @@ be fatal in concurrent code. Here are some examples of these sorts
the following:
a = 0;
- /* Code that does not store to variable a. */
+ ... Code that does not store to variable a ...
a = 0;
The compiler sees that the value of variable 'a' is already zero, so
@@ -1471,7 +1562,7 @@ be fatal in concurrent code. Here are some examples of these sorts
wrong guess:
WRITE_ONCE(a, 0);
- /* Code that does not store to variable a. */
+ ... Code that does not store to variable a ...
WRITE_ONCE(a, 0);
(*) The compiler is within its rights to reorder memory accesses unless
diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h
index c30266e94806..8ef0ccbf8167 100644
--- a/include/asm-generic/bitops/lock.h
+++ b/include/asm-generic/bitops/lock.h
@@ -29,16 +29,16 @@ do { \
* @nr: the bit to set
* @addr: the address to start counting from
*
- * This operation is like clear_bit_unlock, however it is not atomic.
- * It does provide release barrier semantics so it can be used to unlock
- * a bit lock, however it would only be used if no other CPU can modify
- * any bits in the memory until the lock is released (a good example is
- * if the bit lock itself protects access to the other bits in the word).
+ * A weaker form of clear_bit_unlock() as used by __bit_lock_unlock(). If all
+ * the bits in the word are protected by this lock some archs can use weaker
+ * ops to safely unlock.
+ *
+ * See for example x86's implementation.
*/
#define __clear_bit_unlock(nr, addr) \
do { \
- smp_mb(); \
- __clear_bit(nr, addr); \
+ smp_mb__before_atomic(); \
+ clear_bit(nr, addr); \
} while (0)
#endif /* _ASM_GENERIC_BITOPS_LOCK_H_ */