On Fri, 6 Dec 2002, Greg Boyce wrote:
> On 6 Dec 2002, Alan Cox wrote:
> > Take a sample set of machines which have been crashing and run memtest86
> > on a couple. That should tell you if it is RAM. From a sample you can
> > then figure out how to handle the rest (things that come to mind if
> > memtest86 fails on the test machines include replacing the ram in a few
> > more then taking the old ram back to test)
>
> I'll mention it to the people who handle the replacement of hardware, but
> from the sounds of this and Dick's e-mail, it's most likely hardware of
> some sort or possibly overheating. They can decide if they want to try to
> figure out which component is causing the problem, or if they'd prefer to
> just replace the faulty machines completely and worry about tracking the
> component later. We have plenty of spares in the warehouse.
Actually, this does leave one question still: How serious is the problem?
How much would you trust a machine reporting these errors? Most of the
machines are just performing DNS and web service (although with a pretty
high load). The processes on the machine are are cpu and memory
intensive, but there is no critical data stored on most of the machines.
Are the machines likely to give us problems with crashing and data
corruption, or would it be safe to ignore the problem unless we started
noticing odd behavior?
Greg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sat Dec 07 2002 - 22:00:27 EST