Oops and painful death of box, possibly solved

Rick Franchuk (rickf@transpect.net)
Wed, 31 Mar 1999 19:37:09 -0800 (PST)


Recently, I had a contractor of mine install a five Intel boxes (PII-400s and
PII-450s) in a provider in San Jose. Although all the pieces in all the
machines were identical, two started producing the following oops under what
appeared to be moderate to heavy disk usage:

Unable to handle kernel NULL pointer dereference at virtual address 0000000b
current->tss.cr3 = 012c7000, pr3 = 012c7000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c012d075>]
EFLAGS: 00010292
eax: 00001960 ebx: fffffff3 ecx: 49913b2c edx: 49913fb4
esi: c020d394 edi: 00000001 ebp: 0000000b esp: c54c5f38
ds: 0018 es: 0018 ss: 0018
Process httpd (pid: 17592, process nr: 58, stackpage=c54c5000)
Stack: 00000001 c2355c00 c020d394 c301301d 874e0363 0000000e c01288b4 c2355c00
c54c5f80 c54c5f80 c0128ae0 c2355c00 c54c5f80 c3013000 c3013000 00000001
bffffbd0 c3013000 c301301d 0000000e 874e0363 c0128bc5 c3013000 00000000
Call Trace: [<c01288b4>] [<c0128ae0>] [<c0128bc5>] [<c0126caf>] [<c0107a40>]
Code: 8b 6d 00 8b 74 24 18 39 73 48 75 eb 8b 74 24 24 39 73 0c 75

>>EIP: c012d075 <d_lookup+65/dc>
Trace: c01288b4 <cached_lookup+10/4c>
Trace: c0128ae0 <lookup_dentry+fc/1b8>
Trace: c0128bc5 <__namei+29/5c>
Trace: c0126caf <sys_newstat+13/64>
Trace: c0107a40 <system_call+34/38>
Code: c012d075 <d_lookup+65/dc> 00000000 <_EIP>: <===
Code: c012d075 <d_lookup+65/dc> 0: 8b 6d 00 movl 0x0(%ebp),%ebp <===
Code: c012d078 <d_lookup+68/dc> 3: 8b 74 24 18 movl 0x18(%esp,1),%esi
Code: c012d07c <d_lookup+6c/dc> 7: 39 73 48 cmpl %esi,0x48(%ebx)
Code: c012d07f <d_lookup+6f/dc> a: 75 eb jne c012d06c <d_lookup+5c/dc>
Code: c012d081 <d_lookup+71/dc> c: 8b 74 24 24 movl 0x24(%esp,1),%esi
Code: c012d085 <d_lookup+75/dc> 10: 39 73 0c cmpl %esi,0xc(%ebx)
Code: c012d088 <d_lookup+78/dc> 13: 75 00 jne c012d08a <d_lookup+7a/dc>

A numer of oopses would happen in rapid succession, followed by segfaults of
whatever happened to be running and 'cannot fork()' messages streaming down
the screen locally (I never saw them though... I'm in vancouver, so I can't
detail exactly what was on the screen if it wasn't logged).

Curiously, the machine also exhibited the following during boot up (Which was
annoying, because the 'timeouts' involved were fairly long):

hda: no response (status = 0xa1), resetting drive
hda: no response (status = 0xa1)
hdb: no response (status = 0xa1), resetting drive
hdb: no response (status = 0xa1)
hdc: no response (status = 0xa1), resetting drive
hdc: no response (status = 0xa1)
hdd: no response (status = 0xa1), resetting drive
hdd: no response (status = 0xa1)

I have a feeling that this is significant, as once I was able to get our man
in Cali to completely disable all onboard IDE controllers (we run 100% SCSI
using Adaptec 2940UWs, but the OOPSen flared up when on an NCR53c875 we
decided to test), the oops now SEEM to have totally dissolved... I'm writing
in hopes that it could be confirmed that this is indeed the source of the
error (to let me sleep sounder at night) and if it's a specific board-related
issue I can find out the model number so you all can avoid it. ;)

--
  __________________________________________
 |                                          |
 |  Rick Franchuk  -  TranSpecT Consulting  |
 |_______                            _______|
         \mailto:rickf@transpect.net/
          \_____ICQ_#_4435025______/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/