Re: Random reboots

From: Ryan Richter
Date: Tue Feb 14 2006 - 08:27:06 EST


On Tue, Feb 14, 2006 at 09:54:47AM +0100, Erik Mouw wrote:
> On Mon, Feb 13, 2006 at 04:49:57PM -0500, Ryan Richter wrote:
> > On Mon, Feb 13, 2006 at 04:39:29PM -0500, ryan wrote:
> > > It runs Debian Sarge for AMD64. I have lots of other machines, but only
> > > this one gets the reboots. None of the others have SCSI, and none are
> > > dual-CPU with memory on both nodes, just to name two obvious things
> > > different on this machine.
> >
> > Thinking about this some more... My home desktop also is a dual opteron
> > with memory on both nodes and SCSI, but it hasn't had any reboots. The
> > machine with the reboot trouble uses RAID5+LVM, unlike my desktop. Also
> > it's an NFS server, but I have another machine (single-cpu pentium 4, no
> > SCSI etc.) that's an NFS server without reboots. But none of the other
> > machines have RAID or LVM.
>
> We recently had such an issue with a dual AMD64 machine rebooting at
> mke2fs. It turned out it was a faulty power supply. After we changed
> the power supply, everything ran smooth again.
>
> You could start to test by powering your drives from an old AT-style
> power supply leaving more "juice" for the main board and CPUs.

It's possible, but I doubt it. More often than not, the reboot happens
when the machine is completely idle - in fact I can't remember a single
time when it wasn't idle. I just spent a couple months debugging a
SCSI-tape crash, and I ran the backups a lot and had lots of RAID
resyncs and it *never* rebooted during either of these events. Anyway
it has quite a large 2+1 redundant power supply, and, like I said, we
routinely had 3+ months of uptime with older kernels.

During the years I've had this machine, I've experienced at least 10-15
strange kernel bugs that only happened on this machine. Each and every
time I was *convinced* that the hardware was at fault (and people on the
mailing list suggested it) until either a kernel came out that fixed the
problem or a kernel developer positively identified it as a kernel
problem and eventually fixed it. This machine just seems to be a magnet
for kernel bugs.

Thanks,
-ryan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/