Re: [PATCH v13 11/18] cxl/memfeature: Add CXL memory device ECS control feature

From: Jonathan Cameron
Date: Mon Oct 14 2024 - 11:43:31 EST


On Wed, 9 Oct 2024 13:41:12 +0100
<shiju.jose@xxxxxxxxxx> wrote:

> From: Shiju Jose <shiju.jose@xxxxxxxxxx>
>
> CXL spec 3.1 section 8.2.9.9.11.2 describes the DDR5 ECS (Error Check
> Scrub) control feature.
> The Error Check Scrub (ECS) is a feature defined in JEDEC DDR5 SDRAM
> Specification (JESD79-5) and allows the DRAM to internally read, correct
> single-bit errors, and write back corrected data bits to the DRAM array
> while providing transparency to error counts.
I never understood the 'transparency to error counts'.
Maybe from software point of view
'while reporting error counts to the host'.
Unless anyone else can figure out what that text from the CXL spec
means? (I'm guessing it is cut and paste from the JEDEC spec)

>
> The ECS control allows the requester to change the log entry type, the ECS
> threshold count provided that the request is within the definition
> specified in DDR5 mode registers, change mode between codeword mode and
> row count mode, and reset the ECS counter.
>
> Register with EDAC device driver, which gets the ECS attr descriptors
> from the EDAC ECS and expose sysfs ECS control attributes to userspace.
> For example ECS control for the memory media FRU 0 in CXL mem0 device is
> in /sys/bus/edac/devices/cxl_mem0/ecs_fruX/
fru0?
>
> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>

A few little things in line. In general looks good to me.

> ---
> drivers/cxl/core/memfeature.c | 467 +++++++++++++++++++++++++++++++++-
> 1 file changed, 461 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/cxl/core/memfeature.c b/drivers/cxl/core/memfeature.c
> index 84d6e887a4fa..567406566c77 100644
> --- a/drivers/cxl/core/memfeature.c
> +++ b/drivers/cxl/core/memfeature.c
> @@ -19,7 +19,7 @@
> #include <cxl.h>
> #include <linux/edac.h>
>
> -#define CXL_DEV_NUM_RAS_FEATURES 1
> +#define CXL_DEV_NUM_RAS_FEATURES 2
> #define CXL_DEV_HOUR_IN_SECS 3600
>
> #define CXL_SCRUB_NAME_LEN 128
> @@ -309,6 +309,420 @@ static const struct edac_scrub_ops cxl_ps_scrub_ops = {
> .set_cycle_duration = cxl_patrol_scrub_write_scrub_cycle,
> };
>
> +/* CXL DDR5 ECS control definitions */
> +static const uuid_t cxl_ecs_uuid =
> + UUID_INIT(0xe5b13f22, 0x2328, 0x4a14, 0xb8, 0xba, 0xb9, 0x69, 0x1e, \

Why the \?

> + 0x89, 0x33, 0x86);
> +


> +
> +#define CXL_ECS_LOG_ENTRY_TYPE_MASK GENMASK(1, 0)
> +#define CXL_ECS_REALTIME_REPORT_CAP_MASK BIT(0)
> +#define CXL_ECS_THRESHOLD_COUNT_MASK GENMASK(2, 0)
> +#define CXL_ECS_MODE_MASK BIT(3)

That name is a little generic. Maybe CXL_ECS_COUNT_MODE_MASK ?

> +#define CXL_ECS_RESET_COUNTER_MASK BIT(4)
> +
> +static const u16 ecs_supp_threshold[] = { 0, 0, 0, 256, 1024, 4096 };
> +
> +enum {
> + ECS_LOG_ENTRY_TYPE_DRAM = 0x0,
> + ECS_LOG_ENTRY_TYPE_MEM_MEDIA_FRU = 0x1,
> +};
> +
> +enum {
> + ECS_THRESHOLD_256 = 3,
> + ECS_THRESHOLD_1024 = 4,
> + ECS_THRESHOLD_4096 = 5,
> +};
Perhaps move this above the ecs_supp_threshold array and use
static const ecs_supp_threshold[] = {
[ECS_THRESHOLD_256] = 256,
[ECS_THRESHOLD_1024] = 1024,
[ECS_THRESHOLD_4096] = 4096,
};
which will fill the zeros in for you. You don't care about them anyway
as they are undefined values.

> +
> +enum cxl_ecs_mode {
> + ECS_MODE_COUNTS_ROWS = 0,
> + ECS_MODE_COUNTS_CODEWORDS = 1,
> +};

> +
> +static int cxl_mem_ecs_set_attrs(struct device *dev, void *drv_data, int fru_id,
> + struct cxl_ecs_params *params, u8 param_type)
> +{
> + struct cxl_ecs_context *cxl_ecs_ctx = drv_data;
> + struct cxl_memdev *cxlmd = cxl_ecs_ctx->cxlmd;
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> + struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlds);
> + struct cxl_ecs_fru_rd_attrs *fru_rd_attrs;
> + struct cxl_ecs_fru_wr_attrs *fru_wr_attrs;
> + size_t rd_data_size, wr_data_size;
> + u16 num_media_frus, count;
> + size_t data_size;
> + int ret;
> +
> + num_media_frus = cxl_ecs_ctx->num_media_frus;
> + rd_data_size = cxl_ecs_ctx->get_feat_size;
> + wr_data_size = cxl_ecs_ctx->set_feat_size;
> + struct cxl_ecs_rd_attrs *rd_attrs __free(kfree) =
> + kmalloc(rd_data_size, GFP_KERNEL);
> + if (!rd_attrs)
> + return -ENOMEM;
> +
> + data_size = cxl_get_feature(mds, cxl_ecs_uuid,
> + CXL_GET_FEAT_SEL_CURRENT_VALUE,
> + rd_attrs, rd_data_size);
> + if (!data_size)
> + return -EIO;

blank line here as the next line isn't part of this allocate / check
errors block.

> + struct cxl_ecs_wr_attrs *wr_attrs __free(kfree) =
> + kmalloc(wr_data_size, GFP_KERNEL);
> + if (!wr_attrs)
> + return -ENOMEM;
> +
> + /* Fill writable attributes from the current attributes read
CXL uses
/*
* Fill writable
style for multiline comments.

> + * for all the media FRUs.
> + */



> +static int cxl_ecs_get_mode_counts_rows(struct device *dev, void *drv_data,
> + int fru_id, u32 *val)
> +{
> + struct cxl_ecs_params params;
> + int ret;
> +
> + ret = cxl_mem_ecs_get_attrs(dev, drv_data, fru_id, &params);
> + if (ret)
> + return ret;
> +
> + if (params.mode == ECS_MODE_COUNTS_ROWS)
> + *val = 1;
> + else
> + *val = 0;
> +
> + return 0;
> +}
> +
> +static int cxl_ecs_get_mode_counts_codewords(struct device *dev, void *drv_data,
> + int fru_id, u32 *val)
> +{
> + struct cxl_ecs_params params;
> + int ret;

This form is pretty common. Maybe worth some macros like
you have in the edac side of things?

> +
> + ret = cxl_mem_ecs_get_attrs(dev, drv_data, fru_id, &params);
> + if (ret)
> + return ret;
> +
> + if (params.mode == ECS_MODE_COUNTS_CODEWORDS)
> + *val = 1;
> + else
> + *val = 0;
> +
> + return 0;
> +}