Re: SMP + Hyperthreading / Asus PCDL Deluxe / Kernel 2.4.x 2.6.x /Crash/Freeze

From: Len Brown
Date: Fri Mar 12 2004 - 02:08:35 EST


Hmm, read that note too fast...
Since the failure did not follow the package to the BSP socket
(CPU0/CPU1), but instead stayed with the AP (CPU2/CPU3) socket, that
suggests an issue with the MB rather than the processor itself.

-Len

On Fri, 2004-03-12 at 01:27, Len Brown wrote:
> On Thu, 2004-03-11 at 19:42, Richard Browning wrote:
> > On Friday 12 March 2004 00:36, Zwane Mwaikambo wrote:
> > > On Fri, 12 Mar 2004, Richard Browning wrote:
> > > > > For my own curiosity, does switching the processors around do anything?
> > > > > Those MCEs look confined to the non bootstrap processor package.
> > > >
> > > > Switched CPUs. This time I get the following:
> > > >
> > > > CPU3: Machine Check Exception: 000.0004
> > > > CPU2: Machine Check Exception: 000.0004
> > > > Bank 0: a20000008c010400
> > > > Kernel Panic: CPU context corrupt
> > > > In idle task - not syncing
> > > >
> > > > Note that the CPU# designations are swapped and that there's only one
> > > > Bank 0: message. Is this significant?
> > >
> > > Ok, but that's still on the same package so it's not moving with the
> > > processor, thanks. Could you also supply processor info from
> > > /proc/cpuinfo.
> >
> > I suppose that's good (for me); indicates no hardware error?
>
> MCE == hardware error.
> In this case un-recoverable.
>
> I'll take a swing at decoding this, call the Coast Guard if I don't
> return in 30 minutes;-)
>
> http://developer.intel.com/design/pentium4/manuals/25366813.pdf
>
> > Machine Check Exception: 000.0004
>
> fig 14-4 says this means that indeed, you have a valid MCE.
>
> > Bank 0: a20000008c010400
>
> fig 14-6 says:
> 63: valid register contents
> 61: UC -- processor did not correct the error
> 57: PCC -- Processor context corrupt (you're dead)
>
> 0400 is the MCA error code
>
> fig E2 says
> 10 - internal watchdog timeout.
> 26,27 -- TT -- Thread timeout indicator -- both threads timed out
>
> > /proc/cpuinfo of course:
> >
> > processor : 0
> > vendor_id : GenuineIntel
> > cpu family : 15
> > model : 2
>
> I have no idea what causes this error, but it sure sounds specific to
> the processor, and specific to HT -- which matches your experiments.
> I'd imagine that after you verify that you've got the latest BIOS for
> the board and the error persists that you should look into getting that
> specific processor replaced.
>
> cheers,
> -Len
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/