Network-related Oopses on 2.2.13pre14

Simon Kirby (sim@netnation.com)
Sun, 3 Oct 1999 08:47:47 -0700


This is the dump over a serial console from one of our webservers that,
in the last few days, has Oopsed similarly with traces related to
networking (but not exactly the same as this one, I don't think).
Previously, I think I saw something before in tcp_v4_unhash(), but I'm
not sure. This is the first Oops that we've managed to record (over the
serial console), and so perhaps somebody can make something of it.

Unable to handle kernel NULL pointer dereference at virtual address 00000004
current->tss.cr3 = 00101000, %cr3 = 00101000
*pde = 00000000
Oops: 0002
CPU: 1
EIP: 0010:[<c015ef26>]
EFLAGS: 00010046
eax: c913ff04 ebx: 00000297 ecx: c913ffc0 edx: 00000000
esi: c913f3c0 edi: c0223ffc ebp: c0230008 esp: c0231f84
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, process nr: 0, stackpage=c0231000)
Stack: 00000001 c0231fa8 00c82be9 c011b1f5 00001e17 0009e000 c0106000 c010b8ad
c0230000 00000e00 c010a3c8 c0230000 c9b4c000 c0230000 0009e000 c0106000
00000e00 00000080 00000018 ffff0018 ffffff0a c0107a1b 00000010 00000246
Call Trace: [<c011b1f5>] [<c0106000>] [<c010b8ad>] [<c010a3c8>] [<c0106000>] [<c0107a1b>] [<c0106000>]
[<c01001b1>]
Code: c7 42 04 74 8b 25 c0 89 15 74 8b 25 c0 c7 00 00 00 00 00 c7
Aiee, killing interrupt handler
>>EIP: c015ef26 <net_bh+72/1e8>
Trace: c011b1f5 <do_bottom_half+85/a8>
Trace: c0106000 <get_options>
Trace: c010b8ad <do_IRQ+4d/54>
Trace: c010a3c8 <common_interrupt+18/20>
Trace: c0106000 <get_options>
Trace: c0107a1b <cpu_idle+37/4c>
Trace: c0106000 <get_options>
Trace: c01001b1 <L6>
Code: c015ef26 <net_bh+72/1e8>
Code: c015ef26 <net_bh+72/1e8> c7 42 04 74 8b movl $0xc0258b74,0x4(%edx)
Code: c015ef2d <net_bh+79/1e8> 89 15 74 8b 25 movl %edx,0xc0258b74
Code: c015ef39 <net_bh+85/1e8> c7 00 00 00 00 movl $0x0,(%eax)

Aiee, killing interrupt handler
Kernel panic: Attempted to kill the idle task!
In swapper task - not syncing

The "get_options" in the backtrace is odd, and makes me wonder if I
should recompile with fno-omit-frame-pointer. I was not able to see what
is happening here. I have attached the .config file. Also, here is a
disassembly dump of the first part of the net_bh function, if it helps
any:

c015eeb4 <net_bh>:
c015eeb4: 83 ec 04 sub $0x4,%esp
c015eeb7: 55 push %ebp
c015eeb8: 57 push %edi
c015eeb9: 56 push %esi
c015eeba: 53 push %ebx
c015eebb: 8b 0d 88 4a 21 c0 mov 0xc0214a88,%ecx
c015eec1: 89 4c 24 10 mov %ecx,0x10(%esp,1)
c015eec5: 81 3d 24 37 22 c0 24 cmpl $0xc0223724,0xc0223724
c015eecc: 37 22 c0
c015eecf: 74 07 je c015eed8 <net_bh+0x24>
c015eed1: e8 76 47 00 00 call c016364c <qdisc_run_queues>
c015eed6: 89 f6 mov %esi,%esi
c015eed8: 81 3d 74 8b 25 c0 74 cmpl $0xc0258b74,0xc0258b74
c015eedf: 8b 25 c0
c015eee2: 0f 84 78 01 00 00 je c015f060 <net_bh+0x1ac>
c015eee8: a1 88 4a 21 c0 mov 0xc0214a88,%eax
c015eeed: 2b 44 24 10 sub 0x10(%esp,1),%eax
c015eef1: 83 f8 01 cmp $0x1,%eax
c015eef4: 0f 87 8a 01 00 00 ja c015f084 <net_bh+0x1d0>
c015eefa: 9c pushf
c015eefb: 5b pop %ebx
c015eefc: fa cli
c015eefd: f0 0f ba 2d 54 2f 22 lock btsl $0x0,0xc0222f54
c015ef04: c0 00
c015ef06: 0f 82 d6 47 08 00 jb c01e36e2 <stext_lock+0x2112>
c015ef0c: 8b 15 74 8b 25 c0 mov 0xc0258b74,%edx
c015ef12: 31 c0 xor %eax,%eax
c015ef14: 81 fa 74 8b 25 c0 cmp $0xc0258b74,%edx
c015ef1a: 74 2b je c015ef47 <net_bh+0x93>
c015ef1c: 89 d0 mov %edx,%eax
c015ef1e: 8b 10 mov (%eax),%edx
c015ef20: ff 0d 7c 8b 25 c0 decl 0xc0258b7c
c015ef26: c7 42 04 74 8b 25 c0 movl $0xc0258b74,0x4(%edx)
c015ef2d: 89 15 74 8b 25 c0 mov %edx,0xc0258b74
c015ef33: c7 00 00 00 00 00 movl $0x0,(%eax)
c015ef39: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
c015ef40: c7 40 08 00 00 00 00 movl $0x0,0x8(%eax)
c015ef47: f0 0f ba 35 54 2f 22 lock btrl $0x0,0xc0222f54
c015ef4e: c0 00
c015ef50: 53 push %ebx
c015ef51: 9d popf
c015ef52: 89 c6 mov %eax,%esi
c015ef54: 8b 86 80 00 00 00 mov 0x80(%esi),%eax
c015ef5a: 89 46 20 mov %eax,0x20(%esi)
c015ef5d: 89 46 1c mov %eax,0x1c(%esi)
c015ef60: 8b 46 24 mov 0x24(%esi),%eax
...

This is a dual PIII-450 box running on an ASUS P2B-DS board with 256 MB
of ECC SDRAM, and I haven't seen any NMI parity errors reported, so I'm
assuming the RAM is probably fine. These oopses only recently started
happening, and happened first on kernel 2.2.10ac5 (which prompted me to
upgrade to 2.2.13pre14). We have about 8 other machines with the exact
same configuration and the older 2.2.10ac5 kernel that have never Oopsed
in the same way, making me wonder if it's some odd hardware problem. All
kernels compiled with GCC 2.7.2.3 and binutils 2.9.1.0.15.

The systems are SCSI-based with the aic7xxx controller (no IDE). All
have eepro100 NICs. No IRQs shared. Ingo's NMI oopser patch applied,
but only recently (to see if it would help track this down any) --
Oopsing occurred before this. There are no signs of it "choking", it
just seems to blow up immediately each and every time.

When the last oops which occurred (with a different kernel, though,
unfortunately), we took a screenshot (literally). I think I still have
the System.map for this, but I'm not sure. Here is the screenshot:

http://blue.netnation.com/mvc-005x.jpg

Here is the "System.old" file, which may or may not apply to this Oops:

http://blue.netnation.com/System.old

Any help would be appreciated. :) I will recompile the kernel with
-fno-omit-frame-pointer as I'm fairly confident that it will crash again
today or tomorrow.

Simon-

| Simon Kirby | Systems Administration |
| mailto:sim@netnation.com | NetNation Communications |
| http://www.netnation.com/ | Tech: (604) 684-6892 |

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/