Need help on oops!

WANG,YIDING (yiding_wang@am.exch.hp.com)
Mon, 30 Aug 1999 20:44:08 -0600


I am testing my driver for HP Fibre Channel HBA on RedHat Linux 6.0 (Linux
2.2.5-15) and encoutered a problem that I haven't been able to figure out
for a week. Really hope someone can give me some help here.

The driver is currently running multi IO test (5 different file size) for 2
-3 hours, some times even longer. The test is simple copy those five files
from IDE drive to first FC driver, then to second FC drive, then copy back
to ide drive and compare the result. It is a loop test going forever or
until being killed. However, system auto reboot after a period of time
(most likely 3 - 4 hours). Occationally it stopped at Oops!

After system reboot, by checking the test log file, there is no data
corruption. Since system reboot without any warning and sync, the error
message at the time of system decides to reboot was not logged in
/var/log/messages. There is no trace can be found related to the error
that causes system reboot.

I have been used kdb for debugging this problem. Unfortunately, it doesn't
always give me good clues. Sometimes the system hung due to kdb itself.
But I do caught twice with kdb to stop when kernel touched NULL pointer. At
that moment, checking all registers, trace and process, I only found
following functions are involved:
Out put:
Enter kdb due to panic @0xc01084e5 ASSED:88
eax: 0x00003400 ebx: 0xc025c000 ecx: 0xc1df2000 edx: 0xc025c000
esi: 0xc025c000 edi: 0xfffffc18 esp: 0xc1df2000 eip: 0xc01084e5
ebp: 0xc025ff58 ss: 0xc025c000 cs: 0x00000010 eflag: 0x00010282
ds: 0x00000018 es: 0x00000018 origeax: 0xffffffff

with stack trace command, found:
0xc01084e5 __switch_to + 0x31
0xc025c000 init_task_union
0xc1df2000 _end + 0x1b58178

I don't know how to figure out from above information that what is really
causing the panic.

I reload Linux without kdb again today and did some more test. During the
same IO test today, I killed test and running command "ps -e". Surprisely I
got similar panic as before. After all test stopped, I ran "ps -e" twice
and both time system gives "Oops". Following are the output of those Opps:

First time running "ps -e" during the test:
Aug 30 17:34:42 localhost kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000009
Aug 30 17:34:42 localhost kernel: current->tss.cr3 = 01208000, %cr3 =
01208000
Aug 30 17:34:42 localhost kernel: *pde = 00000000
Aug 30 17:34:42 localhost kernel: Oops: 0000
Aug 30 17:34:42 localhost kernel: CPU: 0
Aug 30 17:34:43 localhost kernel: EIP: 0010:[get_status+282/972]
Aug 30 17:34:43 localhost kernel: EFLAGS: 00010206
Aug 30 17:34:43 localhost kernel: eax: c10ee008 ebx: c10ee009 ecx:
00000006 edx: 00000005
Aug 30 17:34:43 localhost kernel: esi: c01d5b43 edi: c10ee008 ebp:
c148e000 esp: c1f1ff18
Aug 30 17:34:43 localhost kernel: ds: 0018 es: 0018 ss: 0018
Aug 30 17:34:43 localhost kernel: Process ps (pid: 4431, process nr: 48,
stackpage=c1f1f000)
Aug 30 17:34:43 localhost kernel: Stack: c0201b80 000001ff 0000000d c10ee008
c18cf280 c18cf280 c10ee010 c012b263
Aug 30 17:34:43 localhost kernel: c18cf380 c18cf280 c10ee000 c12d9ec0
00000246 c12d9ec0 ffffffea c014407b
Aug 30 17:34:43 localhost kernel: 0000032d c10ee000 c0144189 c10ee000
0000032d 00000003 c12d9ec0 ffffffea
Aug 30 17:34:43 localhost kernel: Call Trace: [lookup_dentry+359/496]
[get_process_array+35/96] [array_read+209/484] [filp_open+68/240]
[sys_read+174/196] [system_call+52/56]
Aug 30 17:34:43 localhost kernel: Code: 8b 52 04 eb 03 90 31 d2 52 31 c0 66
8b 85 10 01 00 00 50 31

Second time running "ps -e" after test is stopped:
Aug 30 18:12:29 localhost kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000009
Aug 30 18:12:29 localhost kernel: current->tss.cr3 = 01f20000, %cr3 =
01f20000
Aug 30 18:12:29 localhost kernel: *pde = 00000000
Aug 30 18:12:29 localhost kernel: Oops: 0000
Aug 30 18:12:29 localhost kernel: CPU: 0
Aug 30 18:12:29 localhost kernel: EIP: 0010:[get_status+282/972]
Aug 30 18:12:29 localhost kernel: EFLAGS: 00010206
Aug 30 18:12:29 localhost kernel: eax: c12f1008 ebx: c12f1009 ecx:
00000006 edx: 00000005
Aug 30 18:12:29 localhost kernel: esi: c01d5b43 edi: c12f1008 ebp:
c148e000 esp: c137bf18
Aug 30 18:12:29 localhost kernel: ds: 0018 es: 0018 ss: 0018
Aug 30 18:12:29 localhost kernel: Process ps (pid: 5454, process nr: 54,
stackpage=c137b000)
Aug 30 18:12:29 localhost kernel: Stack: c0201b80 000001ff 0000000d c12f1008
c18cf280 c18cf280 c12f1010 c012b263
Aug 30 18:12:29 localhost kernel: c18cf380 c18cf280 c12f1000 c0a207e0
00000246 c0a207e0 ffffffea c014407b
Aug 30 18:12:29 localhost kernel: 0000032d c12f1000 c0144189 c12f1000
0000032d 00000003 c0a207e0 ffffffea
Aug 30 18:12:29 localhost kernel: Call Trace: [lookup_dentry+359/496]
[get_process_array+35/96] [array_read+209/484] [filp_open+68/240]
[sys_read+174/196] [system_call+52/56]
Aug 30 18:12:29 localhost kernel: Code: 8b 52 04 eb 03 90 31 d2 52 31 c0 66
8b 85 10 01 00 00 50 31

Third time running "ps -e":
Aug 30 18:13:42 localhost kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000009
Aug 30 18:13:42 localhost kernel: current->tss.cr3 = 01f20000, %cr3 =
01f20000
Aug 30 18:13:42 localhost kernel: *pde = 00000000
Aug 30 18:13:42 localhost kernel: Oops: 0000
Aug 30 18:13:42 localhost kernel: CPU: 0
Aug 30 18:13:42 localhost kernel: EIP: 0010:[get_status+282/972]
Aug 30 18:13:42 localhost kernel: EFLAGS: 00010206
Aug 30 18:13:42 localhost kernel: eax: c0537008 ebx: c0537009 ecx:
00000006 edx: 00000005
Aug 30 18:13:42 localhost kernel: esi: c01d5b43 edi: c0537008 ebp:
c148e000 esp: c1f1ff18
Aug 30 18:13:42 localhost kernel: ds: 0018 es: 0018 ss: 0018
Aug 30 18:13:42 localhost kernel: Process ps (pid: 9696, process nr: 54,
stackpage=c1f1f000)
Aug 30 18:13:42 localhost kernel: Stack: c0201b80 000001ff 0000000d c0537008
c18cf280 c18cf280 c0537010 c012b263
Aug 30 18:13:42 localhost kernel: c18cf380 c18cf280 c0537000 c12d9860
00000246 c12d9860 ffffffea c014407b
Aug 30 18:13:42 localhost kernel: 0000032d c0537000 c0144189 c0537000
0000032d 00000003 c12d9860 ffffffea
Aug 30 18:13:42 localhost kernel: Call Trace: [lookup_dentry+359/496]
[get_process_array+35/96] [array_read+209/484] [filp_open+68/240]
[sys_read+174/196] [system_call+52/56]
Aug 30 18:13:42 localhost kernel: Code: 8b 52 04 eb 03 90 31 d2 52 31 c0 66
8b 85 10 01 00 00 50 31

The system is still alive at this point.

Question:

Is the problem from kernel or from my driver?

How do I debug based on the Oops output?

Many thanks!

-eddie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/