Re: Perf Oops on 3.14-rc2

From: Peter Zijlstra
Date: Wed Feb 19 2014 - 15:24:57 EST


On Wed, Feb 19, 2014 at 08:59:08PM +0100, Stephane Eranian wrote:
> On Wed, Feb 19, 2014 at 7:36 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > On Wed, Feb 19, 2014 at 07:03:13PM +0100, Stephane Eranian wrote:
> >> I am trying to understand the context here.
> >> Are you saying, we may call an offline CPU?
> >
> > Yes, that is what's happening.
> >
> >> I saw that sometimes you retry, sometimes you don't.
> >
> > I tried to do exactly what we do for the task case which is far more
> > likely to fail. Could be I messed up.
> >
> I am not sure why you need to retry. If the CPU is offline, it is offline.
> Or are you saying, you get an error, but you don't know the exact
> reason, thus you keep trying? But how do you get out of this if
> the CPU stays offline?

Ah, so take perf_remove_from_context() as before the patch; if the
cpu_function_call() fails because the CPU is offline, it doesn't call
list_del_event().

Now the offline function is supposed to take them off the list, but it
doesn't actually in case they're grouped.

This leaves a free()d event on the offline cpu's context list.

After that things quickly go downwards.

But before I got there I was led down a few too many rabbit holes trying
to figure out wtf happened.


We could probably fix it differently though. But by the time I more or
less understood things I was too tired to make something pretty.

Anyway; if you get to do something if cpu_function_call() fails; you
have to also check if it got back up since you tried; at which point
you've got the same pattern as we have for task_function_call().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/