Re: [PATCH v15 11/15] EDAC: Add memory repair control feature

From: Borislav Petkov
Date: Mon Nov 11 2024 - 06:30:14 EST


On Mon, Nov 04, 2024 at 01:05:31PM +0000, Shiju Jose wrote:
> More detailed explanation of PPR and memory sparing and use cases was added
> in Documentation/edac/memory_repair.rst, which is part of the last common
> patch ("EDAC: Add documentation for RAS feature control") added for
> documentation of various RAS features supported in this series. Was not sure
> the file to be part of this patch or not.

If the commit message doesn't contain a justification for a patch's existence,
why do you even bother sending it?

IOW, no redirections pls - just state here what the use case is in short. You
can always go nuts into details in the docs.

> persist_mode used to readback the value of persist_mode presently set. For
> eg. 1 - soft memory sparing for a sparing instance, though the CXL memory
> device supports both soft and hard sparing, which is configurable.
> persist_mode_avail used to return the temporary and permanent repair
> capability of the device.

Wait, sysfs does a one value per file thing. What does persist_mode_avail
give?

Surely you can't dump a list of all available modes...

>From that doc:

root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair0/persist_mode_avail
0

Does that mean only sPPR is available?

If only one mode is available, why am I even querying things? There's no other
option.

Catch my drift?

> Also I will update here with more details which was given in the last part
> of this document about DPA. Some memory devices (For eg. a CXL memory
> device) may expect the Device Physical Address(DPA) for a repair operation
> instead of Host Physical Address(HPA), because it may not have an active
> mapping in the main host address physical address map. 'dpa_support'
> attribute used to return this info to the user.

All this stuff needs to be documented properly and especially how one is
supposed to use this interface. Not have people go read CXL specs just to be
able to even try to use this. I'd like to see clear steps in the docs what to
do and what they mean.

> The nibble mask actually for CXL memory PPR and memory sparing operations,
> which is reported by the device in DRAM Event Record and to the userspace in the
> CXL DRAM trace event.
> Please see the details from the spec.

This is *exactly* what I mean!

If I have to see the spec in order to use an interface, than that's a major
fail.

> I was not sure add or not these CXL specific details in this EDAC document.

So that document should contain enough info on how to use the interface. You
can always put links to the spec giving people further reading but some
initial how-do-I-use-this-damn-thing example should be there so that people
can find their way around this.

> The visibility of these control attributes to the user in sysfs is decided
> by the is_visible() callback in the EDAC, which in turn depends on a memory
> device support or not the control of a repair attribute.

That still doesn't answer my question: what are valid values I can put in all
those?

Try as many as I can until one sticks?

This is not a good interface.

And since sysfs does one-value-per-file, dumping ranges here is kinda wrong.

> This attribute used request to determine availability of resources for a repair operation
> (For eg. memory PPR and sparing operation) for a given address and memory attributes set.
> The device may return result for this request in different ways.
> For example, in CXL device request query resource command for a,
> 1. PPR operation returns resource availability as a return code of the command.
> 2. memory sparing operation, the device will report the resource availability by producing a
> Memory Sparing Event Record and memory sparing trace event to the userspace.
>
> May be 'dry-run' better name instead of query?

Maybe this should not exist at all: my simple thinking would say that
determining whether resources are available should be part of the actual
repair operation. If none are there, it should return "no resources
available". If there are, it should simply use them and do the repair.

Exposing this as an explicit step sounds silly.

> >Yeh, this needs to be part of the interface and not hidden in some obscure doc.
> Adding this info in Documentation/edac/memory_repair.rst is sufficient?

Yap, for example. You can always concentrate the whole documentation there and
point to it from everywhere else.

> The details of the repairing control was added in
> Documentation/edac/memory_repair.rst, which is part of the common
> patch ("EDAC: Add documentation for RAS feature control").

Ok, point to it pls in this doc so that people can find it.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette