[PATCH v2 0/2] New CMCI storm mitigation for Intel CPUs

From: Tony Luck
Date: Tue Mar 15 2022 - 14:15:20 EST


Two-part motivation:

1) Disabling CMCI globally is an overly big hammer

2) Intel signals some UNCORRECTED errors using CMCI (yes, turns
out that was a poorly chosen name given the later evolution of
the architecture). Since we don't want to miss those, the proposed
storm code just bumps the threshold to (almost) maximum to mitigate,
but not eliminate the storm. Note that the threshold only applies
to corrected errors.

Patch 1 deletes the parts of the old storm code that are no
longer needed.

Patch 2 adds the new per-bank mitigation.

Smita: Unless Boris finds a some more stuff for me to fix, this
version will be a better starting point to merge with your changes.

Changes since v1 (based on feedback from Boris)

- Spelling fixes in commit message
- Many more comments explaining what is going on
- Change name of function that does tracking
- Change names for #defines for storm BEGIN/END
- #define for high threshold in decimal, not hex

Tony Luck (2):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation

arch/x86/kernel/cpu/mce/core.c | 46 +++---
arch/x86/kernel/cpu/mce/intel.c | 241 ++++++++++++++---------------
arch/x86/kernel/cpu/mce/internal.h | 10 +-
3 files changed, 141 insertions(+), 156 deletions(-)


base-commit: ffb217a13a2eaf6d5bd974fc83036a53ca69f1e2
--
2.35.1