Re: Perf hotplug lockup in v4.9-rc8

From: Mark Rutland
Date: Wed Dec 07 2016 - 14:57:37 EST


On Wed, Dec 07, 2016 at 07:34:55PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 07, 2016 at 05:53:47PM +0000, Mark Rutland wrote:
> > On Wed, Dec 07, 2016 at 01:52:17PM +0000, Mark Rutland wrote:
> > > Hi all
> > >
> > > Jeremy noticed a kernel lockup on arm64 when the perf tool was used in
> > > parallel with hotplug, which I've reproduced on arm64 and x86(-64) with
> > > v4.9-rc8. In both cases I'm using defconfig; I've tried enabling lockdep
> > > but it was silent for arm64 and x86.
> >
> > It looks like we're trying to install a task-bound event into a context
> > where task_cpu(ctx->task) is dead, and thus the cpu_function_call() in
> > perf_install_in_context() fails. We retry repeatedly.
> >
> > On !PREEMPT (as with x86 defconfig), we manage to prevent the hotplug
> > machinery from making progress, and this turns into a livelock.
> >
> > On PREEMPT (as with arm64 defconfig), I'm somewhat lost.
>
> So the problem is that even with PREEMPT we can hit a blocked task
> that has a 'dead' cpu.
>
> We'll spin until either the task wakes up or the CPU does, either can
> take a very long time.
>
> How exactly your test-case triggers this, all it executes is 'true' and
> that really shouldn't block much, is a mystery still.

The perf tool forks a helper process, which blocks on a pipe, and once
signalled, execs the target (i.e. true). The main perf process opens
(enable-on-exec) events on that, then writes to the pipe to wake up the
helper.

... so now I see why that makes us see a dead task_cpu(); thanks for the
explanation above!

[...]

> @@ -2352,6 +2357,28 @@ perf_install_in_context(struct perf_event_context *ctx,
> return;
> }
> raw_spin_unlock_irq(&ctx->lock);
> +
> + raw_spin_lock_irq(&task->pi_lock);
> + if (!(task->state == TASK_RUNNING || task->state == TASK_WAKING)) {

For a moment I thought there was a remaining race here with the lazy
ctx-switch if the new task was RUNNING on an online CPU, but I guess
we'll retry the cpu_function_call() in that case.

I'll attack this tomorrow when I can think again...

Thanks,
Mark.