[PATCH 0/4] rasdaemon: cxl: Add support for memory repair operations

From: shiju.jose
Date: Fri Feb 07 2025 - 09:32:37 EST


From: Shiju Jose <shiju.jose@xxxxxxxxxx>

CXL devices provide error records for both corrected and uncorrectable
memory errors. These errors may reflect one off corruption event
(no increase in likelihood or repeat) or be related to a hardware problem
(more likely to repeat). There are many factors in predicting which case
we have.  This patch set focuses on one particular case in which the
device is making a judgement on whether a repeated problem is likely and
suggesting to the OS that it take remedial actions.

CXL spec 3.1, Section 8.2.9.2.1, Table 8-43, "Common Event Record Format"
table defines the Event Record Flags: 'Maintenance Needed' flag, which
indicates if the memory device requires maintenance. CXL DRAM and general
media event handlers exports to userspace (via a tracepoint) the attributes
needed for memory sparing or PPR. These are then available for writing back
to the EDAC memory repair sysfs interface, initiating the sparing/PPR
operation in the CXL memory device.

Firstly this series enables rasdaemon to close the loop and perform live
memory sparing and PPR operations.

Rasdaemon supports live memory repair for the CXL DRAM errors reported,
with 'maintenance needed' flag set. However the kernel CXL driver rejects
the request for the live memory repair in the following situations.
1. Memory is online and the repair is disruptive.
2. Memory is online and event record does not match.
In addition, live memory repair is not requested if the auto repair option
is switched off for the rasdaemon.

In the above unrepaired cases, repair-needed information for CXL DRAM
events must be stored in the CXL DRAM event record of the SQLite database.
This allows a boot-up script to read repair status and repair attributes
in the next boot. If the memory has not been repaired, the script will
issue the memory repair operation requested by the memory device in the
previous boot. The kernel CXL driver sends a repair command to the device
if the memory to be repaired is offline.

Add CXL memory repair boot-up script for handling the unrepaired
CXL DRAM errors from the previous boot.

Notes:
1. The series implemented userspace code for CXL memory repairs using the
proposed EDAC memory repair interface. [1]

2. The code is based on v2 of rasdaemon: cxl: Update CXL event logging and
recording to CXL spec rev 3.1. [2]

1. https://lore.kernel.org/linux-cxl/20250106121017.1620-1-shiju.jose@xxxxxxxxxx/T/#maf191b2a104591f993da00249e67bd483ab67ce0
2. https://lore.kernel.org/lkml/20250110122641.1668-1-shiju.jose@xxxxxxxxxx/

Shiju Jose (4):
rasdaemon: cxl: Add support for memory sparing operation
rasdaemon: cxl: Add support for memory soft PPR operation
rasdaemon: cxl: Add storing memory repair needed info in the DRAM
event record
rasdaemon: cxl: Add CXL memory repair boot-up script for unrepaired
memory errors

misc/rasdaemon.env | 4 +
ras-cxl-handler.c | 386 +++++++++++++++++++++++++++++++++++++++++
ras-record.c | 2 +
ras-record.h | 1 +
util/cxl-mem-repair.sh | 189 ++++++++++++++++++++
5 files changed, 582 insertions(+)
create mode 100755 util/cxl-mem-repair.sh

--
2.43.0