Hard lock on boot with both 2.2.16 and 2.4.0-pre7

From: Nathaniel Smith (njs@has-96-90.reshall.berkeley.edu)
Date: Sun Sep 03 2000 - 13:56:11 EST


I'm not quite certain if this is a user-mode, kernel, or hardware problem,
but there is a reproducible hard lock involved (SysRq keys no longer work,
etc.) and I managed to get some Oops output, so maybe someone here can
give me a clue what's going on. As it is, I have no idea what to do next,
and my box is unusable.

Hardware: dual PIII 500, not overclocked, Intel motherboard, IDE, 3com
 ethernet card, ask if you need any more details...
Kernel: originally 2.2.16, then 2.4.0-pre7
Distro: Debian woody

A few days ago, I ran a standard apt-get upgrade, which seemed to complete
without problems. A few minutes later, the machine locked completely.
After SysRq failed to have effect, rebooted the machine manually, and while
starting daemons it froze solid again, repeatedly printing to the screen:
"stuck on TLB IPI wait (CPU#1)"
(it varies whether it says CPU#1 or CPU#0).
It seems that the relevant daemon was mysql; after disabling mysql, it
waited until getting to postgres to show the same behaviour; after disabling
postgres, it seems to get to sshd before crashing. I can start the box in
single user mode and things seem to mostly work fine (I'm typing this email
on the box now, the ethernet is up, and I had no problems compiling a new
kernel version for testing). Booting kernel 2.4.0-pre7 instead, I saw the
same behaviour, but a more detailed error message (may be a typo or two;
I had to copy it out by hand):

NMI Watchdog detected LOCKUP on CPU0, registers:
CPU: 0
EIP: 0010:[<c01cf0e6>]
EFLAGS: 00000086
eax: c1273000 ebx: 00000046 ecx: 00000000 edx: 00000000
esi: 04000001 edi: 00002710 ebp: c0222000 esp: c0223f30
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage = c0223000)
Stack: 000002aa c01729e1 000000f4 00000202 04000001 00000000 0000000c c0274235
       c017sb07 c7a30ac0 c010bde1 0000000c c1273000 c0223fa8 c02745e0 c025c980
       0000000c c0223fa0 c010bfcb 0000000c c0223fa8 c7a30ac0 c01088c0 c0222000
Call Trace: [<c01729e1>] [<c0172b07>] [<c010bde1>] [<c010bfc6>] [<c01088c0>] [<c01088c0>] [<c010a710>]
            [<c01088c0>] [<c01088c0>] [<c0100018>] [<c01088ed>] [<c0108952>] [<c0105000>] [<c01001d0>]
Code: 80 3d 20 aa 21 c0 00 f3 90 7e f5 e9 a2 3b fa ff 80 3d 20 aa
console shuts up ...

Running this through ksymoops (I believe the only modules loaded at the
point of the crash were 3c95x and sr_mod, but it's difficult to tell, and
ksymoops doesn't seem to mind the lack of module information):

ksymoops 2.3.4 on i686 2.2.16. Options used
     -v /usr/src/linux-2.4.0-pre7/vmlinux (specified)
     -K (specified)
     -L (specified)
     -o /lib/modules/2.4.0-test7/ (specified)
     -m /usr/src/linux-2.4.0-pre7/System.map (specified)

No modules in ksyms, skipping objects
NMI Watchdog detected LOCKUP on CPU0, registers:
CPU: 0
EIP: 0010:[<c01cf0e6>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000086
eax: c1273000 ebx: 00000046 ecx: 00000000 edx: 00000000
esi: 04000001 edi: 00002710 ebp: c0222000 esp: c0223f30
ds: 0018 es: 0018 ss: 0018
Stack: 000002aa c01729e1 000000f4 00000202 04000001 00000000 0000000c c0274235
       c017sb07 c7a30ac0 c010bde1 0000000c c1273000 c0223fa8 c02745e0 c025c980
       0000000c c0223fa0 c010bfcb 0000000c c0223fa8 c7a30ac0 c01088c0 c0222000
Call Trace: [<c01729e1>] [<c0172b07>] [<c010bde1>] [<c010bfc6>] [<c01088c0>] [<c01088c0>] [<c010a710>]
            [<c01088c0>] [<c01088c0>] [<c0100018>] [<c01088ed>] [<c0108952>] [<c0105000>] [<c01001d0>]
Code: 80 3d 20 aa 21 c0 00 f3 90 7e f5 e9 a2 3b fa ff 80 3d 20 aa

>>EIP; c01cf0e6 <stext_lock+3522/7a88> <=====
Trace; c01729e1 <handle_kbd_event+85/18c>
Trace; c0172b07 <keyboard_interrupt+1f/2c>
Trace; c010bde1 <handle_IRQ_event+4d/78>
Trace; c010bfc6 <do_IRQ+a6/f4>
Trace; c01088c0 <default_idle+0/34>
Trace; c01088c0 <default_idle+0/34>
Trace; c010a710 <ret_from_intr+0/20>
Trace; c01088c0 <default_idle+0/34>
Trace; c01088c0 <default_idle+0/34>
Trace; c0100018 <startup_32+18/cc>
Trace; c01088ed <default_idle+2d/34>
Trace; c0108952 <cpu_idle+3e/54>
Trace; c0105000 <empty_bad_page+0/1000>
Trace; c01001d0 <L6+0/2>
Code; c01cf0e6 <stext_lock+3522/7a88>
00000000 <_EIP>:
Code; c01cf0e6 <stext_lock+3522/7a88> <=====
   0: 80 3d 20 aa 21 c0 00 cmpb $0x0,0xc021aa20 <=====
Code; c01cf0ed <stext_lock+3529/7a88>
   7: f3 90 repz nop
Code; c01cf0ef <stext_lock+352b/7a88>
   9: 7e f5 jle 0 <_EIP>
Code; c01cf0f1 <stext_lock+352d/7a88>
   b: e9 a2 3b fa ff jmp fffa3bb2 <_EIP+0xfffa3bb2> c0172c98 <aux_write_ack+4/3c>
Code; c01cf0f6 <stext_lock+3532/7a88>
  10: 80 3d 20 aa 00 00 00 cmpb $0x0,0xaa20

I would appreciate any information at all people can give me about this
problem; as I said, I can't even tell for sure whether it's hardware, kernel,
or userland. The consistency over kernel versions and completeness of the
lock suggest something hardware related, but it seems to have happened
immediately following an upgrade (though I'm not sure; mysql and postgres
were both upgraded, but ssh was not, so it's unclear why it would suddenly
start crashing the box), and is highly sensitive to the program being run.

Thank you in advance for any information you can give.

-- Nathaniel

-- 
"...these, like all words, have single, decontextualized meanings: everyone
knows what each of these words means, everyone knows what constitutes an
instance of each of their referents.  Language is fixed.  Meaning is
certain.  Santa Claus comes down the chimney at midnight on December 24."
  -- The Language War, Robin Lakoff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Sep 07 2000 - 21:00:15 EST