Re: MSI K8D-Master - GART error 3

From: Andi Kleen (ak@colin2.muc.de)
Date: Tue Aug 05 2003 - 08:42:41 EST


On Tue, Aug 05, 2003 at 12:45:01PM +1200, Simon Garner wrote:
> Andi Kleen <ak@muc.de> wrote:
>
> > There is nothing in any of my trees that generates such a message.
> > If it was GART related it would be either "GART TLB error ..." or
> > "extended error gart error". But even that should not happen anymore,
> > see below.
> >
> > I don't know what the RedHat kernel does, they may have changed the
> > MCE handler over the reference port.
> >
>
> A quick google brings up this reference:
> http://www.iglu.org.il/lxr/source/arch/x86_64/kernel/bluesmoke.c

Ok that's the very old MCE code that incorrectly enabled the northbridge
machine check. Don't use that or use mce=off. However I still think
it's a driver bug in your case. If it was the shakey GART MCE itself
you would get a panic because it's a unrecoverable MCE. More
likely the driver is accessing PCI DMA mappings after they got unmapped,
which is a serious bug, but somehow not serious enough that the
northbridge triggers the MCE.

I was confused by your statement that the SuSE 8.2 beta9 kernel
generated that. It didn't because it doesn't contain that old code.

What does a modern kernel like the SuSE one or a x86-64.org kernel
generate exactly?

>
> The error appears to be generated by the code starting around line 152
> in that file.
>
> Btw, what is 'bluesmoke'?

Alan Cox's sense of humour. Look it up in the jargon file.

> > You can always disable it with mce=off or better mce=0
> > as the message seems to be caused by the periodic non fatal MCE check
> > timer.
> >
>
> What will I lose by disabling this?

mce=0 turns off periodic MCE checking for non fatal errors.
That's not a big issue, the worst you lose is reporting of one bit
corrected ECC memory failures.

mce=off turns off MCE reporting for fatal MCE exceptions (however
your box may still crash when something really bad happens)

mce=0 should have turned off the periodic check and your
message very much looks like a periodic one, as actual MCE
exceptions report more data. I'm a bit puzzled why it doesn't
kill the message here. You can try mce=off, but I'm not
sure it will help neither.

Using a newer kernel is probably a good idea anyways, as there
were many bugfixes since then.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Aug 07 2003 - 22:00:28 EST