Re: HARDWARE HELL - POSSIBLE LINUX BUG AND/OR HARDWARE FAILURE ?

From: doctor@fruitbat.org
Date: Thu Feb 17 2000 - 01:37:45 EST


jjs@mail-tele.ccbc.cc.pa.us said ...
>
> First of all, I sincerely apolgise for being off topic here and I don't
> want to bother the great work you guys are doing on the linux kernel and
> system but this is a problem I been dealing with for 6 months or so and I
> am desperate for a answer. Personally I would be happy if my computer
> were to burst into flames, at least then I would know what I have to
> replace. IF YOU ARE TOO BUSY TO HELP WITH A HARDWARE PROBLEM, PLEASE DON'T
> WASTE YOUR TIME AND READ ON :) I KNOW YOU GUYS ARE BUSY

How long have you had this motherboard? How long after you got it did
you first start seeing this problem?

> Oh I forgot to mention I fix pc's for a living and been using linux for
> almost 4 yrs so I am not a newbie.

If you fix PC's then troubleshooting hardware shouldn't be anything new
to you, right ;-)

> MY SYSTEM
>
> -AMD K6-2 400 BOXED CPU (NEVER OVERCLOCKED)
> -SOYO 5EHM MOTHERBOARD - VIA MVP3 CHIPSET, AT STYLE 1 MEG L2 CACHE
> -2 32 MEG PC100 DIMMS
> -WESTERN DIGITAL 10 GIG HD UDMA33, (ALSO HAD THE
> PROBLEM WITH A MAXTOR 5.7 UDMA33 WAS IN THERE INSTEAD)
> -CREATIVE LABS TNT 16 MEG AGP 2D/3D CARD - NVIDIA TNT CHIPSET
> -AUREAL VORTEX1 CHIPSET PCI SOUNDCARD (ALSO HAD THE PROBLEM WITH A
> ENSONIQ AUDIOPCI ES1370 SOUNDCARD IN THERE INSTEAD)
> -NETGEAR FA310TX 10/100 PCI NETWORK CARD - PNIC TULIP CLONE CHIPSET
> -NO CDROM CURRENTELY INSTALLED DUE TO FAILURE
>
> If my wording looks weird or does not make sense it's because my nerves
> are shot over this. Please feel free to ask me what something means
>
> [Software]
> WINDOWS 98
> Slackware 7.0 BARE.I bootdisk glibc based distro and Slackware 4.0 rescue
> rootdisk libc based distro. GNU SHRED from unstable fileutils series
> statically compiled in a redhat 5.x or 6.0 glibc system.
>
>
> Okay I few months ago while doing a disk shred for various reasons
> (no not porn or hacking darnit!) using GNU shred, a file/device shreder,
> Linux just suddentely crashed for no apprentely reason. From then on,
> win98 began to just freeze after sitting for a while or while doing

This could simply be the most visible symptom of a more systemic problem.

> something intensive like a 3d game or demo. Linux off a bootdisk would
> sometimes fail to uncompress the kernel due to crc errors or refuse to
> load a ramdisk due to invalid size, both of which point to memory
> problems. Windows 98 or Linux would just sometimes reboot by themselves in
> the middle of stuff. Windows98 would sometimes bomb for no apprentely
> reason. The system would sometimes even crash before it started loading a
> os and would freeze in the middle of printing a bios message sentance.
> Sometimes reseating the ram, playing with buss speeds, etc would seem to
> make the problem go away but this was always temporary. Well after many
> months of problems I finally got desperate and took the system compleately
> apart, cleaned all the pci card contacts and dimm memory contacts with a
> eraser and then scrubbed the pci cards, motherboard, cpu, and memory all
> with a toothbrush and isopropyl alchol. From then on the system ran fine
> with no problems. That lasted only a few months as we will soon see.

What you describe I've seen before. Here's a few things to check:

Check that none of the mounting screws are stressing the motherboard. If
you had to force a screw into position or shift/bend the motherboard to
get the holes to line up, then you've just stressed it. Stressing the
board can cause traces to crack/break. From then on, the temperature of
the board determines the traces conductivity. I recommend pulling all
the cards and re-positioning the motherboard. Also, make sure that you
don't have to be overly forceful to insert a card. Check the alignment
of the cards connector with the slot on the motherboard. If you have to
shift the card to make it fit, then you might be putting stress on the
connector and/or card.

Check that there is addequate ventilation. Run the system for 24 hours
then immediately open the case and touch the board and all the ICs
(ground yourself first!). Are they hot to the touch? If so, then you
probably don't have enough air flow. A lot of people think that the fan
in the PS is "enough". In many cases, the design of the computer case
doesn't provide very good air flow. Hard disks today tend to run hot,
contributing to the ambient temperature inside the case. Often, just
installing a second fan to circulate the air makes a difference.

> About a month ago while doing a disk shred, the problem appeared again
> and Linux just oop'ed in the middle of running GNU shred, sometimes, it
> would just crash the program, many times the whole kernel would crash with
> "unable to handle null pointer at virtual address, etc" or "stack dump
> process ........" (correct me if I got the messages wrong). I have not
> had windows98 back on the system since the crashes started again so I
> don't know if the problem is back there also but I do know it's not a
> power supply, a pci card or a hard drive bug because all those parts have
> been swapped out before when the problem arose and the problem was still
> there although sometimes it would go away for a while. Just a few days
> ago now, Linux has also begun to blow major chunks just doing a dd
> if=/dev/zero of=/dev/hda bs=1024k on my 10
> gig hd with the process crashing or the whole kernel dying or rebooting.
> I managed one time to get a weird error about LRU TABLE CORRUPT or
> something
>
> After many runs to try to duplicate the problem I finally managed to
> get enough consistant crashes to discover that when the system starts to
> get into one of it's "moods" and crashes almost every time you boot linux
> and run that problem or even just turn on the system and have it crash
> into the middle of bios messages, ALL THE PROBLEMS STOP IF YOU DISABLE THE
> EXTERNAL L2 CACHE. I figured I finally found the source of the problem, a
> bad l2 cache but to be sure I played with the ram, first removing 1 of the
> dimms and it still crashed, then moving the other dimm to that slot and it
> was stable. Now the system is behaving with both dimms in but it has
> pretended to be fixed before and started misbehaving very soon. I also
> noticed the problem seems to go away for a while if the system if left off
> for a while but nothing is overheating from what I can tell.

Classic symptoms of over-heating and can also indicate that a cracked
trace is opening up after being heated sufficiently.

> OKAY!!!! lemme finally end this horribly long message but I am desperate
> for advice. I really suspect the motherboard is shot since I had the ram
> tested and it was okay. I also forgot to mention that my k6-2 400 is a
> boxed cpu which means while you get a 3yr warrenty, you also get a glued
> on small heatsink and a cheap ball bearing fan that even with ball
> bearings only lasted me 1.5 months. Since I did not want to wait for a
> warrenty replacement and I did not want to have to use a fan of the same
> crap size and design, I attempted to pry off the glued on heatsink so I
> could use a better heatsink fan combo, and I managed to chip ceremanic off
> the corner of the cpu. There were no exposed wires or silicon and after

Ouch! Did you experience your crashes before or after your operation on
the CPU? Breaking the ceramic casing means you might have damaged
something else. Have you tried another CPU on the board? Have you tried
the CPU in another known good system?

> I replaced the fan the systems worked fine for months after that. I am
> very tempted to just replace the motherboard and see if this stops the
> problems I am having but I wanted to make sure that it's not another
> problem that is giving the same symptoms of a bad motherboard. Thanks for
> any help and God Bless.
>
> P.S. if this letter gets me booted off the mailing list or is ignored, I
> understand completely but this was my last ditch cry for help before I
> give up and jump off a bridge or save up for a athlon.

Let's not jump to conclusions (pun ;-). Heat related problems are
solvable with a little extra ventilation. Cracked traces, however, are
harder to solve. Answers to the above will reveal more.

> Jason Salopek

-- 
Peter A. Castro (doctor@fruitbat.org) or (pcastro@us.oracle.com)

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Feb 23 2000 - 21:00:17 EST