Question about Kernel MCE and what exactly they could mean

From: Roger Heflin
Date: Thu Jun 30 2005 - 09:36:58 EST


Hello,

I have customer that is getting MCE errors, the errors are
non consistantly on a single cpu (this is a opteron system
with separate memory busses), and the customer believes
that these correspond with the same time he sshes into the
machine from external, they also appear to only be a serious
issue just after boot up, if the machine survives the first
few sshes, we don't seem to have an issue later, so things
seem to be a early boot up issue of some sort.

The kernel is a Suse 2.4 kernel variant (2.4.21-143),
I don't expect anyone to likely know if it is a bug.

I have debugged and fixed a large number of machines getting
kernel panic mce errors (by replacing cpu, and if it still
occurs replacing memory), but I have never seen one being
started by something like inbound ssh, and have never seen
it move from cpu to cpu with what looks like the same "cause".

The pro's of it being hardware are:
It gets mce errors.

The pro's of it being software are:
They aren't consistant on cpu's
The machine survives heavy HPL runs and does not appear
to have broken hardware.
A specific user action seems to cause the issue.

This issue does seem to occur several times a week.

Anyone have any ideas or experience of whether this is a
kernel bug or hardware problem?

Roger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/