Re: [PATCH v13 03/18] EDAC: Add ECS control feature

From: Jonathan Cameron
Date: Mon Oct 14 2024 - 10:33:35 EST


On Wed, 9 Oct 2024 13:41:04 +0100
<shiju.jose@xxxxxxxxxx> wrote:

> From: Shiju Jose <shiju.jose@xxxxxxxxxx>
>
> Add EDAC ECS (Error Check Scrub) control in order to control a memory
> device's ECS feature.
>
> The Error Check Scrub (ECS) is a feature defined in JEDEC DDR5 SDRAM
> Specification (JESD79-5) and allows the DRAM to internally read, correct
> single-bit errors, and write back corrected data bits to the DRAM array
> while providing transparency to error counts.
>
> The DDR5 device contains number of memory media FRUs per device. The
> DDR5 ECS feature and thus the ECS control driver supports configuring
> the ECS parameters per FRU.
>
> The memory devices support ECS feature register with EDAC device
> driver, which retrieves the ECS descriptor from EDAC ECS driver and
> exposes the sysfs ECS control attributes to userspace in
> /sys/bus/edac/devices/<dev-name>/ecs_fruX/.
>
> The common sysfs ECS control interface abstracts the control of an
> arbitrary ECS functionality to a common set of functions.
>
> The support for ECS feature is added separately because the DDR5 ECS
> features control attributes are dissimilar from those of the scrub
> feature.
>
> The sysfs ECS attr nodes would be present only if the client driver
> has implemented the corresponding attr callback function and passed
> in ops to the EDAC RAS feature driver during registration.
>
> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
Hi Shiju

A few minor bits and bobs inline.

> +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/mode_counts_codewords
> +Date: Oct 2024
> +KernelVersion: 6.12

These all need updating to 6.13 given we missed on 6.12.

> +Contact: linux-edac@xxxxxxxxxxxxxxx
> +Description:
> + (RO) True if current mode is ECS counts codewords with errors.
> +
> +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/reset
> +Date: Oct 2024
> +KernelVersion: 6.12
> +Contact: linux-edac@xxxxxxxxxxxxxxx
> +Description:
> + (WO) ECS reset ECC counter.
> + 0 - normal, ECC counter running actively.

For a write only parameter, maybe just reject anything that isn't 1?

> + 1 - reset ECC counter to the default value.
> +
> +What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/threshold
> +Date: Oct 2024
> +KernelVersion: 6.12
> +Contact: linux-edac@xxxxxxxxxxxxxxx
> +Description:
> + (RW) ECS threshold count per GB of memory cells.

In the CXL spec it is Gb of memory cells (bits, not bytes).
I'm assuming this is meant to match that.
Maybe safer to spell it out.


> diff --git a/drivers/edac/ecs.c b/drivers/edac/ecs.c
> new file mode 100755
> index 000000000000..a2b64d7bf6b6
> --- /dev/null
> +++ b/drivers/edac/ecs.c


> diff --git a/include/linux/edac.h b/include/linux/edac.h
> index 5344e2cf6808..20bdb08c7626 100644
> --- a/include/linux/edac.h
> +++ b/include/linux/edac.h

>
> +/**
> + * struct ecs_ops - ECS device operations (all elements optional)
> + * @get_log_entry_type: read the log entry type value.
> + * @set_log_entry_type: set the log entry type value.
> + * @get_log_entry_type_per_dram: read the log entry type per dram value.
> + * @get_log_entry_type_memory_media: read the log entry type per memory media value.
> + * @get_mode: read the mode value.
> + * @set_mode: set the mode value.
> + * @get_mode_counts_rows: read the mode counts rows value.
> + * @get_mode_counts_codewords: read the mode counts codewords value.
> + * @reset: reset the ECS counter.
> + * @get_threshold: read the threshold value.
> + * @set_threshold: set the threshold value.

Maybe it's worth duplicating the ABI docs statement on units?

> + */
> +struct edac_ecs_ops {
> + int (*get_log_entry_type)(struct device *dev, void *drv_data, int fru_id, u32 *val);
> + int (*set_log_entry_type)(struct device *dev, void *drv_data, int fru_id, u32 val);
> + int (*get_log_entry_type_per_dram)(struct device *dev, void *drv_data,
> + int fru_id, u32 *val);
> + int (*get_log_entry_type_per_memory_media)(struct device *dev, void *drv_data,
> + int fru_id, u32 *val);
> + int (*get_mode)(struct device *dev, void *drv_data, int fru_id, u32 *val);
> + int (*set_mode)(struct device *dev, void *drv_data, int fru_id, u32 val);
> + int (*get_mode_counts_rows)(struct device *dev, void *drv_data, int fru_id, u32 *val);
> + int (*get_mode_counts_codewords)(struct device *dev, void *drv_data, int fru_id, u32 *val);
> + int (*reset)(struct device *dev, void *drv_data, int fru_id, u32 val);
> + int (*get_threshold)(struct device *dev, void *drv_data, int fru_id, u32 *threshold);
> + int (*set_threshold)(struct device *dev, void *drv_data, int fru_id, u32 threshold);
> +};