CMCI storm on 16C Intel Denverton on Supermicro A2SDi-16C-TP8F running 5.4.0-77

From: pdan@xxxxxxxxx
Date: Wed Jun 23 2021 - 23:20:53 EST


Apologies if this has been brought up before.

I have recently purchased a Supermicro A2SDi-16C-TP8F featuring an
Intel C3958 (Denverton, 16C) processor, and I am seeing CMCI storm log
messages. Hardware details below. I can 100% reproduce the errors:

- On every boot, after the CPU initialization.
- When saturating the 10GBase-T (Intel X533 integrated into Intel
Denverton) network card (iperf3).

My immediate suspicion was that it's a memory error, however this
seems unlikely:

- Important: Supermicro support confirms they repro on their test
A2SDi-16C-TP8F system when running the same OS. This presumably rules
out my specific system, DIMMs, bios config, etc.
- Memtest86 does not find any errors, nor can I get it to reproduce
when stress testing after booting (memtester, stress-ng, etc.)
- I still can reproduce when removing / swapping DIMMs / slots.

As expected mce=no_cmci disables the notifications although the
implications are unclear to me.

Supermicro support advises that the distribution / kernel I am running
is not supported as this affects newer kernel versions on 16C
Denverton and recommends that I downgrade to a supported version
(https://www.supermicro.com/support/resources/OS/Denverton.cfm), which
is not particularly useful.

Is there any context / solution? I am happy to run any tests on the
system to debug further.

Thank you!
Dan

# dmesg -T|grep -B1 -A1 CMCI
[Wed Jun 23 17:54:18 2021] .... node #0, CPUs: #1 #2 #3 #4
#5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
[Wed Jun 23 17:54:18 2021] mce: CMCI storm detected: switching to poll mode
[Wed Jun 23 17:54:18 2021] smp: Brought up 1 node, 16 CPUs
--
[Wed Jun 23 17:57:43 2021] igb 0000:04:00.1 lan1: igb: lan1 NIC Link
is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[Wed Jun 23 17:59:19 2021] mce: CMCI storm subsided: switching to interrupt mode

# uname -a
Linux atlas 5.4.0-77-generic #86-Ubuntu SMP Thu Jun 17 02:35:03 UTC
2021 x86_64 x86_64 x86_64 GNU/Linux

# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal

# dmidecode | grep -A 3 "Base Board Information"
Base Board Information
Manufacturer: Supermicro
Product Name: A2SDi-16C-TP8F
Version: 2.00

# dmidecode | grep -A 3 "BIOS Information"
BIOS Information
Vendor: American Megatrends Inc.
Version: 1.5
Release Date: 05/17/2021

# lscpu | grep "Model name"
Model name: Intel(R) Atom(TM) CPU C3958 @ 2.00GHz

# lspci
00:00.0 Host bridge: Intel Corporation Atom Processor C3000 Series
System Agent (rev 11)
00:04.0 Host bridge: Intel Corporation Atom Processor C3000 Series
Error Registers (rev 11)
00:05.0 Generic system peripheral [0807]: Intel Corporation Atom
Processor C3000 Series Root Complex Event Collector (rev 11)
00:06.0 PCI bridge: Intel Corporation Atom Processor C3000 Series
Integrated QAT Root Port (rev 11)
00:09.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI
Express Root Port #0 (rev 11)
00:0b.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI
Express Root Port #2 (rev 11)
00:0e.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI
Express Root Port #4 (rev 11)
00:0f.0 PCI bridge: Intel Corporation Atom Processor C3000 Series PCI
Express Root Port #5 (rev 11)
00:12.0 System peripheral: Intel Corporation Atom Processor C3000
Series SMBus Contoller - Host (rev 11)
00:14.0 SATA controller: Intel Corporation Atom Processor C3000 Series
SATA Controller 1 (rev 11)
00:15.0 USB controller: Intel Corporation Atom Processor C3000 Series
USB 3.0 xHCI Controller (rev 11)
00:16.0 PCI bridge: Intel Corporation Atom Processor C3000 Series
Integrated LAN Root Port #0 (rev 11)
00:17.0 PCI bridge: Intel Corporation Atom Processor C3000 Series
Integrated LAN Root Port #1 (rev 11)
00:18.0 Communication controller: Intel Corporation Atom Processor
C3000 Series ME HECI 1 (rev 11)
00:1f.0 ISA bridge: Intel Corporation Atom Processor C3000 Series LPC
or eSPI (rev 11)
00:1f.2 Memory controller: Intel Corporation Atom Processor C3000
Series Power Management Controller (rev 11)
00:1f.4 SMBus: Intel Corporation Atom Processor C3000 Series SMBus
controller (rev 11)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Atom Processor
C3000 Series SPI Controller (rev 11)
01:00.0 Co-processor: Intel Corporation Atom Processor C3000 Series
QuickAssist Technology (rev 11)
04:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
04:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
04:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
04:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
05:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03)
06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED
Graphics Family (rev 30)
07:00.0 Ethernet controller: Intel Corporation Ethernet Connection
X553/X557-AT 10GBASE-T (rev 11)
07:00.1 Ethernet controller: Intel Corporation Ethernet Connection
X553/X557-AT 10GBASE-T (rev 11)
08:00.0 Ethernet controller: Intel Corporation Ethernet Connection
X553 10 GbE SFP+ (rev 11)
08:00.1 Ethernet controller: Intel Corporation Ethernet Connection
X553 10 GbE SFP+ (rev 11)