Re: Dell XPS13: MCE (Hardware Error) reported
From: Borislav Petkov
Date: Wed Jan 04 2017 - 18:07:12 EST
Lemme add some more folks to CC.
On Wed, Jan 04, 2017 at 04:42:18PM +0100, Paul Menzel wrote:
> Dear Linux folks,
>
>
> The logs contain the following messages.
>
> From Linux 4.10-rc2+ (0f64df301240 Merge branch 'parisc-4.10-2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux):
>
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee0000000040110a
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ff40 MISC 47880018086
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: ee0000000040110a
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ce40 MISC 7880018086
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0
>
> I am able to reproduce this also with Linux 4.8.11 from Debian Sid/unstable.
>
> Installing *mcelog* 144+dfsg-1, the file below is created.
>
> ```
> $ more /var/log/mcelog
> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 6
> MISC 47880018086 ADDR fef1ff40
> TIME 1483543069 Wed Jan 4 16:17:49 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 1
> CPU 0 BANK 7
> MISC 7880018086 ADDR fef1ce40
> TIME 1483543069 Wed Jan 4 16:17:49 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 6
> MISC 47880018086 ADDR fef1ff40
> TIME 1483543581 Wed Jan 4 16:26:21 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 1
> CPU 0 BANK 7
> MISC 7880018086 ADDR fef1ce40
> TIME 1483543581 Wed Jan 4 16:26:21 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> ```
>
> It looks like itâs a common problem on this machine [1].
>
> > First, I fear that I cannot really give good answers to your questions. I also own a Dell XPS 13 (9360) and see the same MCE messages. I'm in contact with Dell Support because of these. They replaced the mainboard but it did not help. Same messages in the logs. At some point they concluded that it is probably a false positive. They had no idea what is causing it, though (mcelog/kernel/Intel problem?). The correspondence with Support is still ongoing.
> >
> > <rant> Btw, talking to Dell Support is a very unpleasant experience. They seem to only suggest the "standard" solutions like resetting the Firmware, run self-health tests and so on. I didn't had the impression to talk to someone with some technical insight. </rant>
> >
> > To add more details, I see the same issue on Fedora 24 so it seems not to be related to Ubuntu.
> >
> > Regarding your questions:
> >
> > What do these errors mean and should I worry about them?
> >
> > I don't know. Dell Support thinks those are false positives.
> >
> > Could these hardware errors be the cause of the freezes of the entire system?
> >
> > Besides the messages my system works fine. I'd guess the freeze is a different issue.
> >
> > Should I have the laptop (or parts) replaced by the manufacturer?
> >
> > Replacing the mainboard did not fix the MCE issue. It might solve the freezing issue, although it seems that this was fixed by a kernel update.
> >
> > Are there any other actions I should take?
> >
> > If you are not already in contact with Support, contact them. Maybe they will come up with a real solution once they see that it affects more customers.
>
> Could you please tell me, if and where I should open an issue in the Linux
> bug tracker [2]?
>
> Any ideas are welcome.
>
>
> Kind regards,
>
> Paul
>
>
> [1] https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283
> [2] https://bugzilla.kernel.org/
>
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.