Re: Fw: Re: oops in choose_configuration()

From: Linus Torvalds
Date: Sun Mar 05 2006 - 23:58:17 EST




On Sun, 5 Mar 2006, Andrew Morton wrote:
>
> For several days I've been getting repeatable oopses in the -mm kernel.
> They occur once per ~30 boots, during initscripts.

Actually, having thought about this some more, I wonder if the bug isn't a
hell of a lot simpler than we've given it credit for.

I think you're running with CONFIG_PREEMPT_VOLUNTARY, right?

And looking more closely, that thing is BROKEN. DaveJ - do Fedora kernels
also enable that thing?

Ingo: as far as I can see, CONFIG_PREEMPT_VOLUNTARY is totally and utterly
broken during bootup. It does:

# define might_resched() cond_resched()

and then we have

# define might_sleep() do { might_resched(); } while (0)

and but the fact is, we _know_ that "might_sleep()" is broken during early
bootup. We know this, because when we ahev __might_sleep() enabled to
warn about cases where we must not sleep, we've had those tests disabled
during early boot for a long time, in order to avoid irritating and nasty
known "sleeping function called from invalid context" messages:

...
if ((in_atomic() || irqs_disabled()) &&
system_state == SYSTEM_RUNNING && !oops_in_progress) {
if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
...

Note in particular the "system_state == SYSTEM_RUNNING". It's there for a
reason. Namely that we know that we do things that aren't valid during
early bootup, and that we call functions that might sleep while we have
interrupts disabled, for example.

HOWEVER, the "cond_resched()" does not take that into account at all, and
will happily conditionally reschedule things at early bootup before we
have set system_state to SYSTEM_RUNNING.

In other words, unless I've totally lost it, I think that
CONFIG_PREEMPT_VOLUNTARY currently makes us re-schedule at points in the
early boot that we _know_ are unsafe. We happen to not hit it very often,
because (a) some of the time it doesn't matter and (b) when it matters, we
seldom have "need_resched()" returning true, but I would not be at all
surprised if Andrew's problems are because the scheduler heuristics make
it happen when it shouldn't.

And the end result? I don't know. But we've traditionally run _all_ of the
early boot ignoring the "might_sleep()" warnings, up until the point where
we unlock the kernel lock, long after things like kmem_cache_init().

So I would not be surprised, for example, if we had kmem_cache_init()
doing bad things because it got interrupts enabled at a point where it
shouldn't, because it went through the scheduler.

I dunno. I can't actually see what would corrupt anything, but the point
is that we definitely do scheduling in places that have gotten absolutely
_zero_ coverage, because we turned off the checks on purpose during early
boot because the checks gave false positives.

And CONFIG_PREEMPT_VOLUNTARY turns those false positives into potential
rescheduling events.

Maybe I'm crazy. But it looks really really broken to me.

Andrew, if I'm right, then this ugly patch should make a difference.

Is there something else I've missed?

Linus

----
diff --git a/kernel/sched.c b/kernel/sched.c
index 12d291b..3454bb8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4028,6 +4028,8 @@ static inline void __cond_resched(void)
*/
if (unlikely(preempt_count()))
return;
+ if (unlikely(system_state != SYSTEM_RUNNING))
+ return;
do {
add_preempt_count(PREEMPT_ACTIVE);
schedule();
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/