Re: [PATCH] efi/cper: Add NVIDIA CPER section support

From: Kai-Heng Feng

Date: Tue Feb 24 2026 - 23:53:18 EST


Hi Shiju,

On 2026/2/24 7:23 PM, Shiju Jose wrote:
External email: Use caution opening links or attachments


-----Original Message-----
From: Kai-Heng Feng <kaihengf@xxxxxxxxxx>
Sent: 23 February 2026 06:49
To: ardb@xxxxxxxxxx
Cc: Kai-Heng Feng <kaihengf@xxxxxxxxxx>; Rafael J. Wysocki
<rafael@xxxxxxxxxx>; Tony Luck <tony.luck@xxxxxxxxx>; Borislav Petkov
<bp@xxxxxxxxx>; Guohanjun (Hanjun Guo) <guohanjun@xxxxxxxxxx>; Mauro
Carvalho Chehab <mchehab@xxxxxxxxxx>; Shuai Xue
<xueshuai@xxxxxxxxxxxxxxxxx>; Jonathan Cameron
<jonathan.cameron@xxxxxxxxxx>; Morduan Zang
<zhangdandan@xxxxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-
efi@xxxxxxxxxxxxxxx; linux-acpi@xxxxxxxxxxxxxxx
Subject: [PATCH] efi/cper: Add NVIDIA CPER section support

Add support for decoding NVIDIA-specific error sections in UEFI CPER records.
NVIDIA hardware generates vendor-specific CPER sections containing error
signatures and diagnostic register dumps. This implementation decodes these
sections and prints error details to the kernel log.

The NVIDIA CPER section contains a fixed header with error metadata (signature,
error type, severity, socket) followed by variable-length register address-value
pairs for hardware diagnostics.

This work is based on libcper [0].

Example output:
Hardware error from APEI Generic Hardware Error Source: 816 event severity:
info imprecise tstamp: 2025-11-17 07:57:38 Error 0, type: info
section_type: NVIDIA, error_data_length: 224
signature: HSS-IDLE
error_type: 0
error_instance: 0
severity: 0
socket: 255
number_regs: 12
instance_base: 0x0000000000000000
register[0]: address=0x0000000004f10008 value=0x0000000000002019
register[1]: address=0x0000000000000000 value=0x0000000000000000

[0] https://github.com/openbmc/libcper/commit/683e055061ce
Signed-off-by: Kai-Heng Feng <kaihengf@xxxxxxxxxx>
---
drivers/firmware/efi/Kconfig | 16 ++++++
drivers/firmware/efi/Makefile | 1 +
drivers/firmware/efi/cper-nvidia.c | 79 ++++++++++++++++++++++++++++++
drivers/firmware/efi/cper-nvidia.h | 33 +++++++++++++
drivers/firmware/efi/cper.c | 3 ++
include/linux/cper.h | 4 ++
6 files changed, 136 insertions(+)
create mode 100644 drivers/firmware/efi/cper-nvidia.c
create mode 100644 drivers/firmware/efi/cper-nvidia.h

diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig index
29e0729299f5..ed1f53b8e878 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -329,6 +329,22 @@ config UEFI_CPER_X86
depends on UEFI_CPER && X86
default y

+config UEFI_CPER_NVIDIA
+ bool "UEFI CPER NVIDIA support"
+ depends on UEFI_CPER
+ help
+ This option enables support for decoding NVIDIA-specific error
+ sections in UEFI Common Platform Error Records (CPER). These
+ sections contain additional diagnostic information for errors
+ occurring in NVIDIA hardware such as GPUs, switches, and other
+ devices.
+
+ The NVIDIA CPER sections include error signatures (e.g., PCIe-DPC,
+ DCC-ECC, GPU-STATUS) and diagnostic registers that provide detailed
+ information about hardware errors for debugging and analysis.
+
+ If unsure, say N.
+
config TEE_STMM_EFI
tristate "TEE-based EFI runtime variable service driver"
depends on EFI && OPTEE
diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile index
8efbcf699e4f..a571b6086860 100644
--- a/drivers/firmware/efi/Makefile
+++ b/drivers/firmware/efi/Makefile
@@ -42,5 +42,6 @@ obj-$(CONFIG_EFI_CAPSULE_LOADER) += capsule-
loader.o
obj-$(CONFIG_EFI_EARLYCON) += earlycon.o
obj-$(CONFIG_UEFI_CPER_ARM) += cper-arm.o
obj-$(CONFIG_UEFI_CPER_X86) += cper-x86.o
+obj-$(CONFIG_UEFI_CPER_NVIDIA) += cper-nvidia.o

Hi,

Is drivers/firmware/efi/cper.c the right place to log vendor-specific errors,
given that so far drivers/firmware/efi/ only logs CPER information defined by the standards?
Vendor-specific errors are currently logged and recorded in rasdaemon.
https://github.com/mchehab/rasdaemon
https://github.com/mchehab/rasdaemon/blob/master/ras-non-standard-handler.c#L52

If some kernel-level recovery action or logging is required, we can also register with
acpi/apei/ghes using ghes_register_vendor_record_notifier() to receive a callback.
https://elixir.bootlin.com/linux/v6.19.3/source/drivers/acpi/apei/ghes.c#L652

Thank you for the info. There's indeed an ACPI node for CPER purpose. I'll see if that ACPI HID can be used for implementing using ghes_register_vendor_record_notifier().

Kai-Heng


[...]
+/* NVIDIA Error Section */
+#define CPER_SEC_NVIDIA
\
+ GUID_INIT(0x6d5244f2, 0x2712, 0x11ec, 0xbe, 0xa7, 0xcb, 0x3f, \
+ 0xdb, 0x95, 0xc7, 0x86)

#define CPER_PROC_VALID_TYPE 0x0001
#define CPER_PROC_VALID_ISA 0x0002
--
2.43.0


Thanks,
Shiju