Re: 2.6.21-rc5-mm4

From: Eric W. Biederman
Date: Wed Apr 04 2007 - 13:23:57 EST


Jiri Kosina <jikos@xxxxxxxx> writes:

> On Tue, 3 Apr 2007, Jiri Kosina wrote:
>
>> > we're also having problems reproducing it on that same combination
>> > (2.6.21-rc4 + my tree), so it points to something in -mm. Since your
>> > trace is completely different right now it looks like something else
>> > is fuzzing it up. Since the e1000 changes are in rc5-mm3 as well, that
>> > might help to narrow it down quickly.
>> I don't know (yet) whether rc5-mm3 was OK in this respect, I didn't boot
>> it on this machine. I only know that both rc5 and rc5 + e1000 tree are
>> OK, but rc5-mm4 panics on ifconfig/dhclient on e1000 card immediately on
>> my system.
>> I will start bisection when I get back to the respective machine
>> (tomorrow) and will let you know.
>
> And the bisection winner is
>
> i386-irq-kill-nr_irq_vectors-and-increase-nr_irqs.patch
>
> I don't immediately see how it could be causing it, so adding CCs which
> are listed in the patch.

Weird. I will have to look at that in a little more detail.

Do you know if this problem happens on x86_64?
What does your .config look like?
What does /proc/interrupts look like?
What kind of hardware you running this kernel on?
Can anyone else reproduce this?

The oops clearly shows something using -1 and calling that as an
address I don't know why, but I'm guessing I have triggered a memory
stomp somewhere. I think this is the first time I have seen a small
negative number causing a NULL pointer dereference.

That patch looks innocuous enough that either:
- I just missed changing something I should have.
- Your configuration has an increase in NR_IRQS and that triggered
something.
- The patch simply permuted things so a memory stomp now happens
on the e1000 data structures instead of somewhere else.
- Something doesn't like large irq numbers.

This work is essentially a backport from x86_64 so if your hardware
is 64bit capable testing that should be a fairly easy test, and be
able to rule out large irq numbers as the culprit.

Until I get a good look at -mm I'm going to have a hard time guessing.
But a roving memory stomp is my best guess.


Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/