Re: Memory-ordering recipes

From: Paul E. McKenney
Date: Tue Oct 17 2017 - 17:01:49 EST


On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> Hello!
>
> The topic of memory-ordering recipes came up at the Linux Plumbers
> Conference microconference on Friday, so I thought that I should summarize
> what is currently "out there":

And here is an updated list of potential Linux-kernel examples for a
"recipes" document, and thank you for the feedback. Please feel free
to counterpropose better examples. In addition, if there is some other
pattern that is commonplace and important enough to be included in a
recipes document, please point it out.

Thanx, Paul

------------------------------------------------------------------------

This document lists the litmus-test patterns that we have been discussing,
along with examples from the Linux kernel. This is intended to feed into
the recipes document. All examples are from v4.13.

0. Simple special cases

If there is only one CPU on the one hand or only one variable
on the other, the code will execute in order. There are (as
usual) some things to be careful of:

a. There are some aspects of the C language that are
unordered. For example, in the expression "f(x) + g(y)",
the order in which f and g are called is not defined;
the object code is allowed to use either order or even
to interleave the computations.

b. Compilers are permitted to use the "as-if" rule. That is,
a compiler can emit whatever code it likes, as long as
the results of a single-threaded execution appear just
as if the compiler had followed all the relevant rules.
To see this, compile with a high level of optimization
and run the debugger on the resulting binary.

c. If there is only one variable but multiple CPUs, all
that variable must be properly aligned and all accesses
to that variable must be full sized. Variables that
straddle cachelines or pages void your full-ordering
warranty, as do undersized accesses that load from or
store to only part of the variable.

1. Another simple case: Locking. [ Assuming you don't think too
hard about it, that is! ]

Any CPU that has acquired a given lock sees any changes previously
made by any CPU prior to having released that same lock.

[ Should we discuss chaining back through different locks,
sort of like release-acquire chains? ]

2. MP (see test6.pdf for nickname translation)

a. smp_store_release() / smp_load_acquire()

init_stack_slab() in lib/stackdepot.c uses release-acquire
to handle initialization of a slab of the stack. Working
out the mutual-exclusion design is left as an exercise for
the reader.

b. rcu_assign_pointer() / rcu_dereference()

expand_to_next_prime() does the rcu_assign_pointer(),
and next_prime_number() does the rcu_dereference().
This mediates access to a bit vector that is expanded
as additional primes are needed. These two functions
are in lib/prime_numbers.c.

c. smp_wmb() / smp_rmb()

xlog_state_switch_iclogs() contains the following:

log->l_curr_block -= log->l_logBBsize;
ASSERT(log->l_curr_block >= 0);
smp_wmb();
log->l_curr_cycle++;

And xlog_valid_lsn() contains the following:

cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
smp_rmb();
cur_block = ACCESS_ONCE(log->l_curr_block);

Alternatively, from the comment in perf_output_put_handle()
in kernel/events/ring_buffer.c:

* kernel user
*
* if (LOAD ->data_tail) { LOAD ->data_head
* (A) smp_rmb() (C)
* STORE $data LOAD $data
* smp_wmb() (B) smp_mb() (D)
* STORE ->data_head STORE ->data_tail
* }
*
* Where A pairs with D, and B pairs with C.

The B/C pairing is MP with smp_wmb() and smp_rmb().

d. Replacing either of the above with smp_mb()

Holding off on this one for the moment...

3. LB

a. LB+ctrl+mb

Again, from the comment in perf_output_put_handle()
in kernel/events/ring_buffer.c:

* kernel user
*
* if (LOAD ->data_tail) { LOAD ->data_head
* (A) smp_rmb() (C)
* STORE $data LOAD $data
* smp_wmb() (B) smp_mb() (D)
* STORE ->data_head STORE ->data_tail
* }
*
* Where A pairs with D, and B pairs with C.

The A/D pairing covers this one.

4. Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB

Lots of variety here, can in some cases substitute:

a. READ_ONCE() for smp_load_acquire()
b. WRITE_ONCE() for smp_store_release()
c. Dependencies for both smp_load_acquire() and
smp_store_release().
d. smp_wmb() for smp_store_release() in first thread
of ISA2 and Z6.2.
e. smp_rmb() for smp_load_acquire() in last thread of ISA2.

The canonical illustration of LB involves the various memory
allocators, where you don't want a load from about-to-be-freed
memory to see a store initializing a later incarnation of that
same memory area. But the per-CPU caches make this a very
long and complicated example.

I am not aware of any three-CPU release-acquire chains in the
Linux kernel. There are three-CPU lock-based chains in RCU,
but these are not at all simple, either.

Thoughts?

5. SB

a. smp_mb(), as in lockless wait-wakeup coordination.
And as in sys_membarrier()-scheduler coordination,
for that matter.

Examples seem to be lacking. Most cases use locking.
Here is one rather strange one from RCU:

void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
{
unsigned long flags;
bool needwake;
bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);

rhp->next = NULL;
rhp->func = func;
raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
needwake = !rcu_tasks_cbs_head;
*rcu_tasks_cbs_tail = rhp;
rcu_tasks_cbs_tail = &rhp->next;
raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
/* We can't create the thread unless interrupts are enabled. */
if ((needwake && havetask) ||
(!havetask && !irqs_disabled_flags(flags))) {
rcu_spawn_tasks_kthread();
wake_up(&rcu_tasks_cbs_wq);
}
}

And for the wait side, using synchronize_sched() to supply
the barrier for both ends, with the preemption disabling
due to raw_spin_lock_irqsave() serving as the read-side
critical section:

if (!list) {
wait_event_interruptible(rcu_tasks_cbs_wq,
rcu_tasks_cbs_head);
if (!rcu_tasks_cbs_head) {
WARN_ON(signal_pending(current));
schedule_timeout_interruptible(HZ/10);
}
continue;
}
synchronize_sched();

-----------------

Here is another one that uses atomic_cmpxchg() as a
full memory barrier:

if (!wait_event_timeout(*wait, !atomic_read(stopping),
msecs_to_jiffies(1000))) {
atomic_set(stopping, 0);
smp_mb();
return -ETIMEDOUT;
}

int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
atomic_t *stopping)
{
if (atomic_cmpxchg(stopping, 1, 0)) {
wake_up(wait);
return 1;
}

return 0;
}

-----------------

And here is the generic pattern for the above two examples
taken from waitqueue_active() in include/linux/wait.h:

* CPU0 - waker CPU1 - waiter
*
* for (;;) {
* @cond = true; prepare_to_wait(&wq_head, &wait, state);
* smp_mb(); // smp_mb() from set_current_state()
* if (waitqueue_active(wq_head)) if (@cond)
* wake_up(wq_head); break;
* schedule();
* }
* finish_wait(&wq_head, &wait);

Note that prepare_to_wait() does the both the write
and the set_current_state() that contains the smp_mb().
The read is the waitqueue_active() on the one hand and
the "if (@cond)" on the other.

6. W+RWC+porel+mb+mb

See recipes-LKcode-63cae12bce986.txt.

Mostly of historical interest -- as far as I know, this commit
was the first to contain a litmus test.

7. Context switch and migration. A bit specialized, so might leave
this one out.

When a thread moves from one CPU to another to another, the
scheduler is required to do whatever is necessary for the thread
to see any prior accesses that it executed on other CPUs. This
includes "interesting" interactions with wake_up() shown in the
following comment from try_to_wake_up() in kernel/sched/core.c:

* Notes on Program-Order guarantees on SMP systems.
*
* MIGRATION
*
* The basic program-order guarantee on SMP systems is that when a task [t]
* migrates, all its activity on its old CPU [c0] happens-before any subsequent
* execution on its new CPU [c1].
*
* For migration (of runnable tasks) this is provided by the following means:
*
* A) UNLOCK of the rq(c0)->lock scheduling out task t
* B) migration for t is required to synchronize *both* rq(c0)->lock and
* rq(c1)->lock (if not at the same time, then in that order).
* C) LOCK of the rq(c1)->lock scheduling in task
*
* Transitivity guarantees that B happens after A and C after B.
* Note: we only require RCpc transitivity.
* Note: the CPU doing B need not be c0 or c1
*
* Example:
*
* CPU0 CPU1 CPU2
*
* LOCK rq(0)->lock
* sched-out X
* sched-in Y
* UNLOCK rq(0)->lock
*
* LOCK rq(0)->lock // orders against CPU0
* dequeue X
* UNLOCK rq(0)->lock
*
* LOCK rq(1)->lock
* enqueue X
* UNLOCK rq(1)->lock
*
* LOCK rq(1)->lock // orders against CPU2
* sched-out Z
* sched-in X
* UNLOCK rq(1)->lock
*
*
* BLOCKING -- aka. SLEEP + WAKEUP
*
* For blocking we (obviously) need to provide the same guarantee as for
* migration. However the means are completely different as there is no lock
* chain to provide order. Instead we do:
*
* 1) smp_store_release(X->on_cpu, 0)
* 2) smp_cond_load_acquire(!X->on_cpu)
*
* Example:
*
* CPU0 (schedule) CPU1 (try_to_wake_up) CPU2 (schedule)
*
* LOCK rq(0)->lock LOCK X->pi_lock
* dequeue X
* sched-out X
* smp_store_release(X->on_cpu, 0);
*
* smp_cond_load_acquire(&X->on_cpu, !VAL);
* X->state = WAKING
* set_task_cpu(X,2)
*
* LOCK rq(2)->lock
* enqueue X
* X->state = RUNNING
* UNLOCK rq(2)->lock
*
* LOCK rq(2)->lock // orders against CPU1
* sched-out Z
* sched-in X
* UNLOCK rq(2)->lock
*
* UNLOCK X->pi_lock
* UNLOCK rq(0)->lock
*
*
* However; for wakeups there is a second guarantee we must provide, namely we
* must observe the state that lead to our wakeup. That is, not only must our
* task observe its own prior state, it must also observe the stores prior to
* its wakeup.
*
* This means that any means of doing remote wakeups must order the CPU doing
* the wakeup against the CPU the task is going to end up running on. This,
* however, is already required for the regular Program-Order guarantee above,
* since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).
commit 63cae12bce9861cec309798d34701cf3da20bc71
Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Fri Dec 9 14:59:00 2016 +0100

perf/core: Fix sys_perf_event_open() vs. hotplug

There is problem with installing an event in a task that is 'stuck' on
an offline CPU.

Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
blocked task doesn't run and doesn't require a CPU etc.. Only on
wakeup do we ammend the situation and place the task on a available
CPU.

If we hit such a task with perf_install_in_context() we'll loop until
either that task wakes up or the CPU comes back online, if the task
waking depends on the event being installed, we're stuck.

While looking into this issue, I also spotted another problem, if we
hit a task with perf_install_in_context() that is in the middle of
being migrated, that is we observe the old CPU before sending the IPI,
but run the IPI (on the old CPU) while the task is already running on
the new CPU, things also go sideways.

Rework things to rely on task_curr() -- outside of rq->lock -- which
is rather tricky. Imagine the following scenario where we're trying to
install the first event into our task 't':

CPU0 CPU1 CPU2

(current == t)

t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);

switch(t, n);
migrate(t, 2);
switch(p, t);

ctx = t->perf_event_ctxp[]; // must not be NULL

smp_function_call(cpu, ..);

generic_exec_single()
func();
spin_lock(ctx->lock);
if (task_curr(t)) // false

add_event_to_ctx();
spin_unlock(ctx->lock);

perf_event_context_sched_in();
spin_lock(ctx->lock);
// sees event

So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
Because if CPU2's load of that variable were to observe NULL, it would
not try to schedule the ctx and we'd have a task running without its
counter, which would be 'bad'.

As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
first and not see the event yet, then CPU0 must observe task_curr()
and retry. If the install happens first, then we must see the event on
sched-in and all is well.

I think we can translate the first part (until the 'must not be NULL')
of the scenario to a litmus test like:

C C-peterz

{
}

P0(int *x, int *y)
{
int r1;

WRITE_ONCE(*x, 1);
smp_mb();
r1 = READ_ONCE(*y);
}

P1(int *y, int *z)
{
WRITE_ONCE(*y, 1);
smp_store_release(z, 1);
}

P2(int *x, int *z)
{
int r1;
int r2;

r1 = smp_load_acquire(z);
smp_mb();
r2 = READ_ONCE(*x);
}

exists
(0:r1=0 /\ 2:r1=1 /\ 2:r2=0)

Where:
x is perf_event_ctxp[],
y is our tasks's CPU, and
z is our task being placed on the rq of CPU2.

The P0 smp_mb() is the one added by this patch, ordering the store to
perf_event_ctxp[] from find_get_context() and the load of task_cpu()
in task_function_call().

The smp_store_release/smp_load_acquire model the RCpc locking of the
rq->lock and the smp_mb() of P2 is the context switch switching from
whatever CPU2 was running to our task 't'.

This litmus test evaluates into:

Test C-peterz Allowed
States 7
0:r1=0; 2:r1=0; 2:r2=0;
0:r1=0; 2:r1=0; 2:r2=1;
0:r1=0; 2:r1=1; 2:r2=1;
0:r1=1; 2:r1=0; 2:r2=0;
0:r1=1; 2:r1=0; 2:r2=1;
0:r1=1; 2:r1=1; 2:r2=0;
0:r1=1; 2:r1=1; 2:r2=1;
No
Witnesses
Positive: 0 Negative: 7
Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
Observation C-peterz Never 0 7
Hash=e427f41d9146b2a5445101d3e2fcaa34

And the strong and weak model agree.

Reported-by: Mark Rutland <mark.rutland@xxxxxxx>
Tested-by: Mark Rutland <mark.rutland@xxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Cc: Alexander Shishkin <alexander.shishkin@xxxxxxxxxxxxxxx>
Cc: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>
Cc: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>
Cc: Jiri Olsa <jolsa@xxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
Cc: Stephane Eranian <eranian@xxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Vince Weaver <vincent.weaver@xxxxxxxxx>
Cc: Will Deacon <will.deacon@xxxxxxx>
Cc: jeremy.linton@xxxxxxx
Link: http://lkml.kernel.org/r/20161209135900.GU3174@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>

diff --git a/kernel/events/core.c b/kernel/events/core.c
index ab15509fab8c..72ce7d63e561 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2249,7 +2249,7 @@ static int __perf_install_in_context(void *info)
struct perf_event_context *ctx = event->ctx;
struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
struct perf_event_context *task_ctx = cpuctx->task_ctx;
- bool activate = true;
+ bool reprogram = true;
int ret = 0;

raw_spin_lock(&cpuctx->ctx.lock);
@@ -2257,27 +2257,26 @@ static int __perf_install_in_context(void *info)
raw_spin_lock(&ctx->lock);
task_ctx = ctx;

- /* If we're on the wrong CPU, try again */
- if (task_cpu(ctx->task) != smp_processor_id()) {
- ret = -ESRCH;
- goto unlock;
- }
+ reprogram = (ctx->task == current);

/*
- * If we're on the right CPU, see if the task we target is
- * current, if not we don't have to activate the ctx, a future
- * context switch will do that for us.
+ * If the task is running, it must be running on this CPU,
+ * otherwise we cannot reprogram things.
+ *
+ * If its not running, we don't care, ctx->lock will
+ * serialize against it becoming runnable.
*/
- if (ctx->task != current)
- activate = false;
- else
- WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx);
+ if (task_curr(ctx->task) && !reprogram) {
+ ret = -ESRCH;
+ goto unlock;
+ }

+ WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx);
} else if (task_ctx) {
raw_spin_lock(&task_ctx->lock);
}

- if (activate) {
+ if (reprogram) {
ctx_sched_out(ctx, cpuctx, EVENT_TIME);
add_event_to_ctx(event, ctx);
ctx_resched(cpuctx, task_ctx);
@@ -2328,13 +2327,36 @@ perf_install_in_context(struct perf_event_context *ctx,
/*
* Installing events is tricky because we cannot rely on ctx->is_active
* to be set in case this is the nr_events 0 -> 1 transition.
+ *
+ * Instead we use task_curr(), which tells us if the task is running.
+ * However, since we use task_curr() outside of rq::lock, we can race
+ * against the actual state. This means the result can be wrong.
+ *
+ * If we get a false positive, we retry, this is harmless.
+ *
+ * If we get a false negative, things are complicated. If we are after
+ * perf_event_context_sched_in() ctx::lock will serialize us, and the
+ * value must be correct. If we're before, it doesn't matter since
+ * perf_event_context_sched_in() will program the counter.
+ *
+ * However, this hinges on the remote context switch having observed
+ * our task->perf_event_ctxp[] store, such that it will in fact take
+ * ctx::lock in perf_event_context_sched_in().
+ *
+ * We do this by task_function_call(), if the IPI fails to hit the task
+ * we know any future context switch of task must see the
+ * perf_event_ctpx[] store.
*/
-again:
+
/*
- * Cannot use task_function_call() because we need to run on the task's
- * CPU regardless of whether its current or not.
+ * This smp_mb() orders the task->perf_event_ctxp[] store with the
+ * task_cpu() load, such that if the IPI then does not find the task
+ * running, a future context switch of that task must observe the
+ * store.
*/
- if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event))
+ smp_mb();
+again:
+ if (!task_function_call(task, __perf_install_in_context, event))
return;

raw_spin_lock_irq(&ctx->lock);
@@ -2348,12 +2370,16 @@ again:
raw_spin_unlock_irq(&ctx->lock);
return;
}
- raw_spin_unlock_irq(&ctx->lock);
/*
- * Since !ctx->is_active doesn't mean anything, we must IPI
- * unconditionally.
+ * If the task is not running, ctx->lock will avoid it becoming so,
+ * thus we can safely install the event.
*/
- goto again;
+ if (task_curr(task)) {
+ raw_spin_unlock_irq(&ctx->lock);
+ goto again;
+ }
+ add_event_to_ctx(event, ctx);
+ raw_spin_unlock_irq(&ctx->lock);
}

/*