RE: [PATCH v18 04/19] EDAC: Add memory repair control feature

From: Shiju Jose
Date: Thu Jan 09 2025 - 06:00:58 EST


>-----Original Message-----
>From: Borislav Petkov <bp@xxxxxxxxx>
>Sent: 09 January 2025 09:19
>To: Shiju Jose <shiju.jose@xxxxxxxxxx>
>Cc: linux-edac@xxxxxxxxxxxxxxx; linux-cxl@xxxxxxxxxxxxxxx; linux-
>acpi@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
>tony.luck@xxxxxxxxx; rafael@xxxxxxxxxx; lenb@xxxxxxxxxx;
>mchehab@xxxxxxxxxx; dan.j.williams@xxxxxxxxx; dave@xxxxxxxxxxxx; Jonathan
>Cameron <jonathan.cameron@xxxxxxxxxx>; dave.jiang@xxxxxxxxx;
>alison.schofield@xxxxxxxxx; vishal.l.verma@xxxxxxxxx; ira.weiny@xxxxxxxxx;
>david@xxxxxxxxxx; Vilas.Sridharan@xxxxxxx; leo.duran@xxxxxxx;
>Yazen.Ghannam@xxxxxxx; rientjes@xxxxxxxxxx; jiaqiyan@xxxxxxxxxx;
>Jon.Grimm@xxxxxxx; dave.hansen@xxxxxxxxxxxxxxx;
>naoya.horiguchi@xxxxxxx; james.morse@xxxxxxx; jthoughton@xxxxxxxxxx;
>somasundaram.a@xxxxxxx; erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx;
>duenwen@xxxxxxxxxx; gthelen@xxxxxxxxxx;
>wschwartz@xxxxxxxxxxxxxxxxxxx; dferguson@xxxxxxxxxxxxxxxxxxx;
>wbs@xxxxxxxxxxxxxxxxxxxxxx; nifan.cxl@xxxxxxxxx; tanxiaofei
><tanxiaofei@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto
>Sassu <roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx;
>wanghuiqiang <wanghuiqiang@xxxxxxxxxx>; Linuxarm
><linuxarm@xxxxxxxxxx>
>Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
>
>On Mon, Jan 06, 2025 at 12:10:00PM +0000, shiju.jose@xxxxxxxxxx wrote:
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_hpa
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_dpa
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_nibble_mask
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank_group
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_rank
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_row
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_column
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_channel
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_sub_channel
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_hpa
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_dpa
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_nibble_mask
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank_group
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_rank
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_row
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_column
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_channel
>> +What: /sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_sub_channel
>
>So this is new. I don't remember seeing that when I looked at your patches the
>last time.
>
>Looks like you have all those attributes and now you've decided to add a min and
>max for each one, in addition. And UI-wise it is a madness as there are gazillion
>single-value files now.
>

Thanks for the feedbacks.

The min_ and max_ attributes of the control attributes are added for your
feedback on V15 to expose supported ranges of these control attributes to the user,
in the following links.
However these min_ and max_ attributes are 'RO' instead of 'RW' as specified in the doc,
which to be fixed in the doc.
https://lore.kernel.org/lkml/20241114133249.GEZzX8ATNyc_Xw1L52@fat_crate.local/
https://lore.kernel.org/lkml/fa5d6bdd08104cf1a09c4960a0f9bc46@xxxxxxxxxx/
https://lore.kernel.org/lkml/20241119123657.GCZzyGaZIExvUHPLKL@fat_crate.local/

>"Attributes should be ASCII text files, preferably with only one value per file. It is
>noted that it may not be efficient to contain only one value per file, so it is
>socially acceptable to express an array of values of the same type."
>
>So you don't need those - you can simply express each attribute as a range:
>
>echo "1:2" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
>
>or if you wanna scrub only one bank:

After internal discussion, we think this is the source of the confusion.
This is not scrub where a range would indeed make sense. It is repair.
We are not aware of a failure mechanism where a set of memory banks
would fail together but not the whole of the next level up in the memory topology.

In theory we might get a stupid device design where it reports coarse level
errors but can only repair at fine levels where a range might be appropriate.
We are not sure that makes sense in practice and with a range interface we will
get mess like running out of repair resources half way through a list with
no visibility of what is repaired.

However, given the repair flow is driven by userspace receiving error records
that will only possible values to repair, we think these bounds on what can be
repaired are a nice to have rather than necessary so we would propose we do not
add max_ and min_ for now and see how the use cases evolve.
>
>echo "1:1" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
>
>What is the use case of that thing?
>
>Someone might find it useful so let's add it preemptively?
>
>Pfff.
>
>--
>Regards/Gruss,
> Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Thanks,
Shiju