Re: [PATCH v18 04/19] EDAC: Add memory repair control feature

From: Jonathan Cameron
Date: Thu Jan 09 2025 - 11:02:22 EST


On Thu, 9 Jan 2025 16:18:54 +0100
Borislav Petkov <bp@xxxxxxxxx> wrote:

> On Thu, Jan 09, 2025 at 02:24:33PM +0000, Jonathan Cameron wrote:
> > To my thinking that would fail the test of being an intuitive interface.
> > To issue a repair command requires that multiple attributes be configured
> > before triggering the actual repair.
> >
> > Think of it as setting the coordinates of the repair in a high dimensional
> > space.
>
> Why?

Ok. To me the fact it's not a single write was relevant. Seems not
in your mental model of how this works. For me a single write
that you cannot query back is fine, setting lots of parameters and
being unable to query any of them less so. I guess you disagree.
In interests of progress I'm not going to argue further. No one is
going to use this interface by hand anyway so the lost of useability
I'm seeing doesn't matter a lot.

>
> You can write every attribute in its separate file and have a "commit" or
> "start" file which does that.

That's what we have.

>
> Or you can designate a file which starts the process. This is how I'm
> injecting errors on x86:
>
> see readme_msg here: arch/x86/kernel/cpu/mce/inject.c
>
> More specifically:
>
> "flags:\t Injection type to be performed. Writing to this file will trigger a\n"
> "\t real machine check, an APIC interrupt or invoke the error decoder routines\n"
> "\t for AMD processors.\n"
>
> So you set everything else, and as the last step you set the injection type
> *and* you also trigger it with this one write.

Agreed. I'm not sure of the relevance though. This is how it works and
there is no proposal to change that.

What I was trying to argue was for an interface that let you set all the
coordinates and read back what they were before hitting go.

>
> > Sure. In this case the addition of min/max was perhaps a wrong response to
> > your request for a way to those ranges rather than just rejecting a write
> > of something out of range as earlier version did.
> >
> > We can revisit in future if range discovery becomes necessary. Personally
> > I don't think it is given we are only taking these actions in response error
> > records that give us precisely what to write and hence are always in range.
>
> My goal here was to make this user-friendly. Because you need some way of
> knowing what valid ranges are and in order to trigger the repair, if it needs
> to happen for a range.

In at least the CXL case I'm fairly sure most of them are not discoverable.
Until you see errors you have no idea what the memory topology is.

>
> Or, you can teach the repair logic to ignore invalid ranges and "clamp" things
> to whatever makes sense.

For that you'd need to have a path to read back what happened.

>
> Again, I'm looking at it from the usability perspective. I haven't actually
> needed this scrub+repair functionality yet to know whether the UI makes sense.
> So yeah, collecting some feedback from real-life use cases would probably give
> you a lot better understanding of how that UI should be designed... perhaps
> you won't ever need the ranges, whow knows.
>
> So yes, preemptively designing stuff like that "in the dark" is kinda hard.
> :-)

The discoverability is unnecessary for any known usecase.

Ok. Then can we just drop the range discoverability entirely or we go with
your suggestion and do not support read back of what has been
requested but instead have the reads return a range if known or "" /
return -EONOTSUPP if simply not known?


I can live with that though to me we are heading in the direction of
a less intuitive interface to save a small number of additional files.

Jonathan

>