Re: Perf hotplug lockup in v4.9-rc8
From: Peter Zijlstra
Date: Mon Dec 12 2016 - 07:42:37 EST
On Mon, Dec 12, 2016 at 11:46:40AM +0000, Will Deacon wrote:
> > @@ -2331,13 +2330,36 @@ perf_install_in_context(struct perf_event_context *ctx,
> > /*
> > * Installing events is tricky because we cannot rely on ctx->is_active
> > * to be set in case this is the nr_events 0 -> 1 transition.
> > + *
> > + * Instead we use task_curr(), which tells us if the task is running.
> > + * However, since we use task_curr() outside of rq::lock, we can race
> > + * against the actual state. This means the result can be wrong.
> > + *
> > + * If we get a false positive, we retry, this is harmless.
> > + *
> > + * If we get a false negative, things are complicated. If we are after
> > + * perf_event_context_sched_in() ctx::lock will serialize us, and the
> > + * value must be correct. If we're before, it doesn't matter since
> > + * perf_event_context_sched_in() will program the counter.
> > + *
> > + * However, this hinges on the remote context switch having observed
> > + * our task->perf_event_ctxp[] store, such that it will in fact take
> > + * ctx::lock in perf_event_context_sched_in().
> > + *
> > + * We do this by task_function_call(), if the IPI fails to hit the task
> > + * we know any future context switch of task must see the
> > + * perf_event_ctpx[] store.
> > */
> > +
> > /*
> > + * This smp_mb() orders the task->perf_event_ctxp[] store with the
> > + * task_cpu() load, such that if the IPI then does not find the task
> > + * running, a future context switch of that task must observe the
> > + * store.
> > */
> > + smp_mb();
> > +again:
> > + if (!task_function_call(task, __perf_install_in_context, event))
> > return;
>
> I'm trying to figure out whether or not the barriers implied by the IPI
> are sufficient here, or whether we really need the explicit smp_mb().
> Certainly, arch_send_call_function_single_ipi has to order the publishing
> of the remote work before the signalling of the interrupt, but the comment
> above refers to "the task_cpu() load" and I can't see that after your
> diff.
>
> What are you trying to order here?
I suppose something like this:
CPU0 CPU1 CPU2
(current == t)
t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);
switch(t, n);
migrate(t, 2);
switch(p, t);
ctx = t->perf_event_ctxp[]; // must not be NULL
smp_function_call(cpu, ..);
generic_exec_single()
func();
spin_lock(ctx->lock);
if (task_curr(t)) // false
add_event_to_ctx();
spin_unlock(ctx->lock);
perf_event_context_sched_in();
spin_lock(ctx->lock);
// sees event
Where between setting the perf_event_ctxp[] and sending the IPI the task
moves away and the IPI misses, and while the new CPU is in the middle of
scheduling in t, it hasn't yet passed through perf_event_sched_in(), but
when it does, it _must_ observe the ctx value we stored.
My thinking was that the IPI itself is not sufficient since when it
misses the task, nothing then guarantees we see the store. However, if
we order the store and the task_cpu() load, then any context
switching/migrating involved with changing that value, should ensure we
see our prior store.
Of course, even now writing this, I'm still confused.