Re: Intermittent early panic in try_to_wake_up

From: Con Kolivas
Date: Sun Nov 08 2009 - 03:29:31 EST


On Sun, 8 Nov 2009 03:35:54 Peter Zijlstra wrote:
> On Sat, 2009-11-07 at 12:24 -0400, Kevin Winchester wrote:
> > Mike Galbraith wrote:
> > > On Fri, 2009-11-06 at 19:49 -0400, Kevin Winchester wrote:
> > >> The patch below does not apply to mainline, unless I'm doing something
> > >> wrong. It's against -tip, I assume? Is it just as applicable to
> > >> mainline?
> > >
> > > It was mainline, but I had the scheduler pull request and another in
> > > for testing as well. Linus has pulled, so it'll apply now, with
> > > offsets.
> >
> > It did end up applying, but did not have any effect. Looking at the
> > patch again, I see that it appears to only affect CONFIG_SMP, which I am
> > not running (and in fact it adds a build warning for the !SMP case). So
> > there was not much chance of it fixing anything, I suppose.
> >
> > Any other ideas? I don't have a serial console, and the trace scrolls
> > off my console, so I don't know if any debug printks would help. Would
> > it help if I copied the entire panic message entirely, including the Code
> > section? I can try that the next time it happens.
>
> Use vga=ask boot_delay=100 select the highest res possible.
>
> Possibly you could use a digital (video) camera to record the output.
>

For what it's worth I've seen this on BFS and assumed it was a bfs issue until
I spotted this thread so I'll tell you what I discovered when I was
investigating it, but unfortunately I did not find the root cause.

Incredibly the bug happened in try_to_wake_up where the task struct that was
in the call function (p) gets dereferenced before the rq lock is grabbed. Then
when the rq lock is attempted to be grabbed it has no p to reference.

Further investigation showed it to always be ksoftirqd spawning on bootup only
and never in any other situation. The factors that were common was that there
would always be a conditional resched that occurred and that's how it would
get lost. I tried stepping through the boot process on kvm but always came up
stumped as to how on earth it even happened. The only common variable was that
it -only- ever happened with voluntary preempt enabled, and not with full
preempt or no-preempt. cond_resched is called 2 or 3 times during the boot
sequence via might_sleep by that stage, but if I removed each might_sleep one
at a time it would just happen from a different might_sleep, suggesting we
weren't sleeping when we shouldn't. Since I'm anti-fan of voluntary preempt, I
gave up trying to find the root cause and put this nonsense workaround in
__cond_resched :

static void __cond_resched(void)
{
if (unlikely(system_state != SYSTEM_RUNNING))
return;

And it's still there in BFS, but it fixes the problem, in case someone wanted
to use voluntary with bfs. I've long since lost the config that caused the
problem reliably and can't guarantee that it's the same thing happening on
mainline, but figured the information might be helpful.

Regards,
--
-ck
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/