2.2.16 / 486 / Machine Lockup

From: madsen@vijit.com
Date: Mon Jul 31 2000 - 22:47:51 EST


First, I'm sorry if I'm posting to the wrong forum. I do think this is
a kernel issue.

Problem: A machine running 2.2.16 completely locks up
without warning or error message when a key is pressed *after having been up
for a while*. "A while" seems to be more than a couple of hours, but I
don't know an exact time (maybe overnight). Before "a while", the keyboard
works just fine.

Details: This is a machine that normally sits in a corner; normal access
is via network rather than keyboard, so after it's booted, the keyboard
isn't touched. After "a while", if you touch the keyboard, the screen
is unblanked from the screen saver but then then entire machine locks up
with no message. At that point, NO key works and it doesn't respond to
the network; the only recovery is the reset button. (Great fsck test! :-))
If the key pressed happens to be "return", a new "login:" prompt is
displayed before the machine locks. (Assuming it was at "login:" already).

I do not know if pressing a key before "a while" passes prolongs "a while".

I may be wrong, but I do not believe this is a hardware problem, and I
*hope* that this isn't unique to this machine... It's not heat; everything
else works just fine. It has genuine :-) parity memory, so failing memory
should be detected as such. (BTW, how does Linux handle parity/ECC problems,
anyway?)

I think that this problem (or a similar one) had been discussed on this
list earlier (something about stuck or uncleared keyboard interrupts), but
that it was difficult to reproduce and that everyone thought it was fixed
(this was in an earlier "dot" release). My recollection may be way faulty,
though...

I know that this problem has been there since at least .12, probably
earlier, but I don't remember. (Heh, I think it was ok in 2.0 :-) :-)).

Machine: EISA 486 DX 50 (NOT DX/2!), 20 MB memory, AMI MB, Adaptec 274x,
SCSI CDROM (Sony, I think), Seagate SCSI 2 GB disk. Logitech bus mouse.
Serial/parallel ports in machine but unused. 3C579 (UTP).
No interrupts shared (at least on purpose! :-)).
Used to have 300 MB ESDI disk, but the chassis couldn't cool both FH drives
simultaneously. :-) [Maxtor ESDI drive still worked when I took it
off-line; it was over 10 years!]. The Adaptec ESDI controller is still in
the machine functioning as diskette controller because changing EISA config
to make 274x the diskette controller is painful. (Machine doesn't have
MSDOS on-disk...)

Current workaround is sysadmin via network; if local access is required,
shutdown from remote access, then walk over to machine. I could live with
this problem except for two reasons: (1) Every once in a while, someone
hits the keyboard at the wrong time; (2) If it were my software, I'd be
angry that I missed a bug. It would bother me. And I'd want to be told
about it if I didn't already know.

Unfortunately, I don't know enough about the kernel to dig around myself
to be able to better characterize the problem, and right now, "the
kernel" is a pretty big elephant to try to get my arms around. This is
off-topic, but it sure would be nice to have an UP TO DATE guide or
*something* that would enable an experienced programmer who otherwise
just doesn't know the kernel to "get into" it. Yeah, I know, it depends
on what part of the kernel, but I haven't been able to find any decent
higher level view of the thing. Those who have worked on linux for a
long time will have accumulated this knowledge (and will keep it current),
and would know where to start looking for this problem, and how to track
it down. Having worked on other OSes, I know how I would proceed *on
those*, but on Linux, I'm clueless. ... And yes, I know, the ideal personal
solution would be for me to "dig in", learn it myself, and produce a
guide for others along the way. Alas, I just don't have the time.
Anybody with decent RTFM references is welcome to let me know about them!
(If sent private e-mail, I will summarize and post unless requested not to).

Dave Madsen ---dcm
madsen@vijit.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Aug 07 2000 - 21:00:06 EST