Re: Dell XPS13: MCE (Hardware Error) reported

From: Paul Menzel
Date: Mon Jan 09 2017 - 06:54:02 EST


Dear Ashosk, dear Borislav,


On 01/05/17 02:12, Raj, Ashok wrote:

CPUID Vendor Intel Family 6 Model 142
This is Kabylake Mobile

Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 7880018086 ADDR fef1ce40
TIME 1483543069 Wed Jan 4 16:17:49 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0

Decoding the bits further from MCi_STATUS above:
Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
been signaled by a CMCI.

PCC=1, but should be ignored when EN=0.
MCACOD: 110a MSCOD: 0040

If the system is stable enough after the report, can you send the output of
/proc/interrupts to confirm that.

To be clear, other than the message, the system is stable for me.

Here is `/proc/interrupts`.

```
$ more /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 27 0 0 0 IR-IO-APIC 2-edge timer
1: 3 2 125 5 IR-IO-APIC 1-edge i8042
8: 0 1 0 0 IR-IO-APIC 8-edge rtc0
9: 108 31 397 5 IR-IO-APIC 9-fasteoi acpi
12: 66 18 92 35 IR-IO-APIC 12-edge i8042
14: 0 0 0 0 IR-IO-APIC 14-fasteoi INT344B:00
16: 0 0 0 0 IR-IO-APIC 16-fasteoi idma64.0, i801_smbus, i2c_designware.0
17: 419 42 280 415 IR-IO-APIC 17-fasteoi idma64.1, i2c_designware.1
51: 2 0 0 1 IR-IO-APIC 51-fasteoi DLL075B:01
120: 0 0 0 0 DMAR-MSI 0-edge dmar0
121: 0 0 0 0 DMAR-MSI 1-edge dmar1
274: 17 2 0 4 IR-PCI-MSI 30932992-edge rtsx_pci
275: 89 26 57 45 IR-PCI-MSI 327680-edge xhci_hcd
276: 1886 0 2361 0 IR-PCI-MSI 31457280-edge nvme0q0, nvme0q1
277: 0 3010 2570 0 IR-PCI-MSI 31457281-edge nvme0q2
278: 0 0 2023 3480 IR-PCI-MSI 31457282-edge nvme0q3
279: 0 3319 0 5863 IR-PCI-MSI 31457283-edge nvme0q4
280: 45 0 0 0 IR-PCI-MSI 360448-edge mei_me
281: 201 52 3008 85 IR-PCI-MSI 32768-edge i915
282: 151 29 997 24821 IR-PCI-MSI 30408704-edge ath10k_pci
283: 331 938 677 188 IR-PCI-MSI 514048-edge snd_hda_intel:card0
NMI: 1 0 0 0 Non-maskable interrupts
LOC: 15198 21708 16850 31954 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 1 0 0 0 Performance monitoring interrupts
IWI: 3 0 0 0 IRQ work interrupts
RTR: 0 0 0 0 APIC ICR read retries
RES: 1329 1974 1532 1959 Rescheduling interrupts
CAL: 2254 3827 1969 3963 Function call interrupts
TLB: 396 2349 342 2193 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 9 9 9 9 Machine check polls
ERR: 17
MIS: 0
PIN: 0 0 0 0 Posted-interrupt notification event
PIW: 0 0 0 0 Posted-interrupt wakeup event
```

Although its reported as a L2 error, some memory errors can also manifest
itself as a cache error in certain cases. In this case it looks like
some speculative fetch from bad memory might be the cause.

MCGCAP c08 APICID 0 SOCKETID 0

MCG_CAP: c08
Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
Threshold based error reporting (bit 11) (TES_P).


Do you have another machine which doesn't report these errors? if so try
swapping memory between them to see if the error disappears.

No, I don’t. And everybody I talked to with a Dell XPS13 (9360) seems to have these errors.

I don't have the model specific error handy.. will check that in the meantime
to get some decoding as well.

If you haven't already running some memory tests would also help.

I need some time for that.

If you replaced the motherboard, did that involve both cpu and memory?
or just the motheboard swap?

Sorry, I don’t know, as I am not the person from StackExchange [1].


Kind regards,

Paul


[1] https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283