Re: [PATCH 2/2] RAS/AMD/ATL: Translate normalized to system physical addresses using PRM

From: Yazen Ghannam
Date: Sun Apr 07 2024 - 10:17:41 EST




On 3/26/24 17:26, John Allen wrote:
Future AMD platforms will provide a UEFI PRM module that implements a
number of address translation PRM handlers. This will provide an
interface for the OS to call platform specific code without requiring
the use of SMM or other heavy firmware operations.

AMD Zen-based systems report memory error addresses through Machine
Check banks representing Unified Memory Controllers (UMCs) in the form
of UMC relative "normalized" addresses. A normalized address must be
converted to a system physical address to be usable by the OS.

Add support for the normalized to system physical address translation
PRM handler in the AMD Address Translation Library and prefer it over
native code if available. The GUID and parameter buffer structure are
specific to the normalized to system physical address handler provided
by the address translation PRM module included in future AMD systems.

The address translation PRM module is documented in chapter 22 of the
publicly available "AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh
ACPI v6.5 Porting Guide":
https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/58088-0.75-pub.pdf

Signed-off-by: John Allen <john.allen@xxxxxxx>
---
drivers/ras/amd/atl/Makefile | 1 +
drivers/ras/amd/atl/internal.h | 2 ++
drivers/ras/amd/atl/prm.c | 61 ++++++++++++++++++++++++++++++++++
drivers/ras/amd/atl/umc.c | 5 +++
4 files changed, 69 insertions(+)
create mode 100644 drivers/ras/amd/atl/prm.c

diff --git a/drivers/ras/amd/atl/Makefile b/drivers/ras/amd/atl/Makefile
index 4acd5f05bd9c..8f1afa793e3b 100644
--- a/drivers/ras/amd/atl/Makefile
+++ b/drivers/ras/amd/atl/Makefile
@@ -14,5 +14,6 @@ amd_atl-y += denormalize.o
amd_atl-y += map.o
amd_atl-y += system.o
amd_atl-y += umc.o
+amd_atl-y += prm.o
obj-$(CONFIG_AMD_ATL) += amd_atl.o
diff --git a/drivers/ras/amd/atl/internal.h b/drivers/ras/amd/atl/internal.h
index 5de69e0bb0f9..f739dcada126 100644
--- a/drivers/ras/amd/atl/internal.h
+++ b/drivers/ras/amd/atl/internal.h
@@ -234,6 +234,8 @@ int dehash_address(struct addr_ctx *ctx);
unsigned long norm_to_sys_addr(u8 socket_id, u8 die_id, u8 coh_st_inst_id, unsigned long addr);
unsigned long convert_umc_mca_addr_to_sys_addr(struct atl_err *err);
+unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 umc_bank_inst_id, unsigned long addr);
+
/*
* Make a gap in @data that is @num_bits long starting at @bit_num.
* e.g. data = 11111111'b
diff --git a/drivers/ras/amd/atl/prm.c b/drivers/ras/amd/atl/prm.c
new file mode 100644
index 000000000000..54a69e660eb5
--- /dev/null
+++ b/drivers/ras/amd/atl/prm.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * AMD Address Translation Library
+ *
+ * prm.c : Plumbing code to UEFI Platform Runtime Mechanism (PRM)
+ *
+ * Copyright (c) 2024, Advanced Micro Devices, Inc.
+ * All Rights Reserved.
+ *
+ * Author: John Allen <john.allen@xxxxxxx>
+ */
+
+#include "internal.h"
+
+#if defined(CONFIG_ACPI_PRMT)
+
+#include <linux/prmt.h>
+
+struct prm_umc_param_buffer_norm {
+ u64 norm_addr;
+ u8 socket;
+ u64 umc_bank_inst_id;
+ void *output_buffer;
+} __packed;
+
+const guid_t norm_to_sys_prm_handler_guid = GUID_INIT(0xE7180659, 0xA65D,

Use the static keyword since this is only used in the current file.

+ 0x451D, 0x92, 0xCD,
+ 0x2B, 0x56, 0xF1, 0x2B,
+ 0xEB, 0xA6);
+
+unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 umc_bank_inst_id, unsigned long addr)
+{
+ struct prm_umc_param_buffer_norm param_buffer;
+ unsigned long ret_addr;
+ int ret;
+
+ param_buffer.norm_addr = addr;
+ param_buffer.socket = socket_id;
+ param_buffer.umc_bank_inst_id = umc_bank_inst_id;
+ param_buffer.output_buffer = &ret_addr;
+
+ ret = acpi_call_prm_handler(norm_to_sys_prm_handler_guid, &param_buffer);
+ if (!ret)
+ return ret_addr;
+
+ if (ret == -ENODEV)
+ pr_info("PRM module/handler not available\n");

Make this a pr_debug(). I don't think this is something a user could do
anything about. And one goal of this library to abstract how the
functions work. So "trying different backends" is a library developer
concern.

+ else
+ pr_info("PRM address translation failed\n");

Make this a pr_notice_once().

If the handler is available and fails, then this is likely a bug. It
should be reported to the system vendor. And it may be possible for the
user to update the PRM handler. This could be through a BIOS update or
the runtime update option for PRM.

Aside: is the runtime update option implemented?

"Notice" is between info and warning. I think we'd want the user to
notice, but this isn't so severe to need a warning.

Also, *_once() will prevent duplicate messages in the case of multiple
memory errors in the system. The handler shouldn't fail on any valid
input, so a single notice is enough. Especially if the message doesn't
have any error/context-specific details.

Another aside: it's possible to have invalid input. This can happen in
"software/simulated" MCA errors, i.e. the user provides an arbitrary
value for MCA_ADDR. But this would be a user error. I don't think it's
worth trying to filter out this case. An expert user could provide valid
inputs, and they may want to test the full flow. And this isn't an issue
just for PRM but the ATL overall. I hit this myself while testing
another feature. I used a signature for MCA_ADDR (0xC001C0DE01ABCDEF ?)
and the translation failed. But I was more interested in the signature
than the real value. :)

+
+ return ret;
+}
+
+#else /* ACPI_PRMT */
+
+unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 umc_bank_inst_id, unsigned long addr)
+{
+ return -ENODEV;
+}
+
+#endif
diff --git a/drivers/ras/amd/atl/umc.c b/drivers/ras/amd/atl/umc.c
index 59b6169093f7..954cbe6bf465 100644
--- a/drivers/ras/amd/atl/umc.c
+++ b/drivers/ras/amd/atl/umc.c
@@ -333,9 +333,14 @@ unsigned long convert_umc_mca_addr_to_sys_addr(struct atl_err *err)
u8 coh_st_inst_id = get_coh_st_inst_id(err);
unsigned long addr = get_addr(err->addr);
u8 die_id = get_die_id(err);
+ unsigned long ret_addr;
pr_debug("socket_id=0x%x die_id=0x%x coh_st_inst_id=0x%x addr=0x%016lx",
socket_id, die_id, coh_st_inst_id, addr);
+ ret_addr = prm_umc_norm_to_sys_addr(socket_id, err->ipid, addr);
+ if (!IS_ERR_VALUE(ret_addr))
+ return ret_addr;
+
return norm_to_sys_addr(socket_id, die_id, coh_st_inst_id, addr);
}

Thanks,
Yazen