[PATCH] EDAC/amd64: Add MI300 row retirement support

From: Yazen Ghannam
Date: Sun Feb 04 2024 - 10:51:49 EST


AMD MI300 systems have on-die High Bandwidth Memory. This memory has a
relatively higher error rate, and it is not individually replaceable
like DIMMs.

Uncorrectable ECC errors are individually reported as Deferred errors
using the AMD Deferred error interrupt. Each reported error corresponds
to a single hardware error.

Correctable ECC errors may reported in batches through MCA Thresholding.
Users can configure the threshold limit based on their policy. Each
reported Correctable error represents a single occurrence of the
threshold limit being reached.

The current guidance from AMD designers is that memory affected by ECC
errors within a DRAM row should be retired. Action should be taken on
every reported ECC error.

Add a helper function to apply this policy for MI300 systems.

This and similar functionality may be best handled in a separate,
generic module. In the meantime, do this in AMD64 EDAC for simplicity.

Signed-off-by: Yazen Ghannam <yazen.ghannam@xxxxxxx>
---
Notes:

This is a complete rewrite of the following patch:
https://lore.kernel.org/r/20231129073521.2127403-7-muralimk@xxxxxxx

I'd like to include Murali as co-developer, since this is based on his
work.

The remaining MI300 RAS work will be focused on saving and restoring bad
memory information across reboots. The latest set on the mailing list is
here:
https://lore.kernel.org/r/20231129075034.2159223-1-muralimk@xxxxxxx

drivers/edac/Kconfig | 1 +
drivers/edac/amd64_edac.c | 48 +++++++++++++++++++++++++++++++++++++++
2 files changed, 49 insertions(+)

diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
index 16c8de5050e5..8b147403c955 100644
--- a/drivers/edac/Kconfig
+++ b/drivers/edac/Kconfig
@@ -78,6 +78,7 @@ config EDAC_GHES
config EDAC_AMD64
tristate "AMD64 (Opteron, Athlon64)"
depends on AMD_NB && EDAC_DECODE_MCE
+ depends on MEMORY_FAILURE
imply AMD_ATL
help
Support for error detection and correction of DRAM ECC errors on
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index ca9a8641652d..ee2f3ff15ab7 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2795,6 +2795,51 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
err->csrow = m->synd & 0x7;
}

+/*
+ * When a DRAM ECC error occurs on MI300 systems, it is recommended to retire
+ * all memory within that DRAM row. This applies to the memory with a DRAM
+ * bank.
+ *
+ * To find the memory addresses, loop through permutations of the DRAM column
+ * bits and find the System Physical address of each. The column bits are used
+ * to calculate the intermediate Normalized address, so all permutations should
+ * be checked.
+ *
+ * See amd_atl::convert_dram_to_norm_addr_mi300() for MI300 address formats.
+ */
+#define MI300_UMC_MCA_COL GENMASK(5, 1)
+#define MI300_NUM_COL BIT(HWEIGHT(MI300_UMC_MCA_COL))
+static void retire_row_mi300(struct atl_err *a_err)
+{
+ unsigned long addr;
+ struct page *p;
+ u8 col;
+
+ for (col = 0; col < MI300_NUM_COL; col++) {
+ a_err->addr &= ~MI300_UMC_MCA_COL;
+ a_err->addr |= FIELD_PREP(MI300_UMC_MCA_COL, col);
+
+ addr = amd_convert_umc_mca_addr_to_sys_addr(a_err);
+ if (IS_ERR_VALUE(addr))
+ continue;
+
+ addr = PHYS_PFN(addr);
+
+ /*
+ * Skip invalid or already poisoned pages to avoid unnecessary
+ * error messages from memory_failure().
+ */
+ p = pfn_to_online_page(addr);
+ if (!p)
+ continue;
+
+ if (PageHWPoison(p))
+ continue;
+
+ memory_failure(addr, 0);
+ }
+}
+
static void decode_umc_error(int node_id, struct mce *m)
{
u8 ecc_type = (m->status >> 45) & 0x3;
@@ -2845,6 +2890,9 @@ static void decode_umc_error(int node_id, struct mce *m)

error_address_to_page_and_offset(sys_addr, &err);

+ if (pvt->fam == 0x19 && pvt->dram_type == MEM_HBM3)
+ retire_row_mi300(&a_err);
+
log_error:
__log_ecc_error(mci, &err, ecc_type);
}
--
2.34.1