Re: [PATCH v18 04/19] EDAC: Add memory repair control feature

From: Dan Williams
Date: Fri Jan 10 2025 - 17:49:51 EST


Jonathan Cameron wrote:
> On Thu, 9 Jan 2025 15:51:39 -0800
> Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>
> > Jonathan Cameron wrote:
> > > Ok. Best path is drop the available range support then (so no min_ max_ or
> > > anything to replace them for now).
> >
> > I think less is more in this case.
>
> A few notes before I get to specific questions.
>
> Key in the discussion that follows is that the 'repair' is a separate
> from the 'decision to repair'. They mostly need different information
> all of which is in the error trace points. A lot of the questions
> are about the 'decision to repair' part not the repair itself.
>
[snipped the parts I agree with]
> I'd advise mostly ignoring PPR and looking at memory sparing in
> the CXL spec if you want to consider an example. For PPR DPA is used
> (there is an HPA option that might be available). DPA is still needed
> for on boot soft repair (or we could delay that until regions configured,
> but then we'd need to do DPA to HPA mapping as that will be different
> on a new config, and then write HPA for the kernel to map it back to DPA.

This is helpful because I was indeed getting lost in what kind of
"repair" was being discussed in the thread. Ok, lets focus on sparing
commands.

>
> >
> > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too
> > wide for userspace to have a chance at writing a competent tool. At
> > least I am struggling with where to even begin with those ABIs if I was
> > asked to write a tool. Does a tool already exist for those?
>
> There is little choice on that - those are the controls for this type
> of repair. If we do something like a magic 'key' based on a concatenation
> of those values we need to define that description to replace a clean
> self describing interface. I'm not 100% against that but I think it would
> be a terrible interface design and I don't think anyone is really in favor of it.
>
> All a userspace tool does is read the error record fields of
> exactly those names. From that it will log data (already happening under
> those names in RAS daemon alongside HPA/ DPA). Then, in simplest case,
> a threshold is passed and we write those values to the repair interface.
>
> There is zero need in that simple case for these to be understood at all.

This is where you lose me. The error record is a point in time snapshot
of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security
model for memory operations is based on coordinating with the kernel's
understanding of how that SPA is *currently* being used.

The kernel can not just take userspace's word for it that potentially
data changing, or temporary loss-of-use operations are safe to execute
just because once upon a time userspace saw an error record that
implicated a given SPA in the past, especially over reboot.

The SPA:HPA:DPA:DIMM tuple is invalidated on reconfiguration and reboot
events. It follows that the kernel has a security/integrity interest in
declining to act on invalidated tuples. This is solvable within a single
boot as the kernel can cache the error records and userspace can ask to
"trigger sparing based on cached record X".

For the reboot case when the error record cache is flushed the kernel
needs a reliable way to refill that cache, not an ABI for userspace to
say "take my word for it, this *should* be safe".

[snipped the explanation of replaying the old trace record parameters
data back through sysfs, because that is precisely the hang up I have
with the proposal]

>
> >
> > Some questions that read on those ABIs are:
> >
> > 1/ What if the platform has translation between HPA (CXL decode) and SPA
> > (physical addresses reported in trace points that PIO and DMA see)?
>
> See later for discussion of other interfaces.. This is assuming the
> repair key is not HPA (it is on some systems / situations) - if it's
> the repair key then that is easily dealt with.
>
> HPA / SPA more or less irrelevant for repair itself, they are relevant
> for the decision to repair. In the 'on reboot' soft repair case they may
> not even exist at the time of repair as it would be expected to happen
> before we've brought up a region (to get the RAM into a good state at boot).
>
> For cases where the memory decoders are configured and so there is an HPA
> to DPA mapping:
> The trace reports provide both all these somewhat magic values and
> the HPA. Thus we can do the HPA aware stuff on that before then looking
> up the other bit of the appropriate error reports to get the bank row etc.
>
> >
> > 2/ What if memory is interleaved across repair domains?
>
> Also not relevant to a repair control and only a little relevant to the
> decision to repair. The domains would be handled separately but if
> we are have to offline a chunk of memory to do it (needed for repair
> on some devices) that may be a bigger chunk if fine grained interleave
> in use and that may affect the decision.

Again, the repair control assumes that the kernel can just trust
userspace to get it right. When the kernel knows the SPA implications it
can add safety like "you are going to issue sparing on deviceA that will
temporarily take deviceA offline. CXL subsystem tells me deviceA is
interleaved with deviceB in SPA so the whole SPA range needs to be
offline before this operation proceeds". That is not someting that
userspace can reliably coordinate.

> > 3/ What if the device does not use DDR terminology / topology terms for
> > repair?
>
> Then we provide the additional interfaces assuming the correspond to well
> known terms. If someone is using a magic key then we can get grumpy
> with them, but that can also be supported.
>
> Mostly I'd expect a new technology to overlap a lot of the existing
> interface and maybe add one or two more; which layer in the stack for
> HBM for instance.

The concern is the assertion that sysfs needs to care about all these
parameters vs an ABI that says "repair errorX". If persistence and
validity of error records is the concern lets build an ABI for that and
not depend upon trust in userspace to properly coordinate memory
integrity concerns.

>
> The main alternative is where the device takes an HPA / SPA / DPA. We have one
> driver that does that queued up behind this series that uses HPA. PPR uses
> DPA. In that case userspace first tries to see if it can repair by HPA then
> DPA and if not moves on to see if it it can use the fuller description.
> We will see devices supporting HPA / DPA (which to use depends on when you
> are doing the operation and what has been configured) but mostly I'd expect
> either HPA/DPA or fine grained on a given repair instance.
>
> HPA only works if the address decoders are always configured (so not on CXL)
> What is actually happening in that case is typically that a firmware is
> involved that can look up address decoders etc, and map the control HPA
> to Bank / row etc to issue the actual low level commands. This keeps
> the memory mapping completely secret rather than exposing it in error
> records.
>
> >
> > I expect the flow rasdaemon would want is that the current PFA (leaky
> > bucket Pre-Failure Analysis) decides that the number of soft-offlines it
> > has performed exceeds some threshold and it wants to attempt to repair
> > memory.
>
> Sparing may happen prior to point where we'd have done a soft offline
> if non disruptive (whether it is can be read from another bit of the
> ABI). Memory repair might be much less disruptive than soft-offline!
> I rather hope memory manufacturers build that, but I'm aware of at least
> one case where they didn't and the memory must be offline.

That's a good point, spare before offline makes sense.

[..]
> However, there are other usecases where this isn't needed which is why
> that isn't a precursor for this series.
>
> Initial enablement targets two situations:
> 1) Repair can be done in non disruptive way - no need to soft offline at all.

Modulo needing to quiesce access over the sparing event?

> 2) Repair can be done at boot before memory is onlined or on admin
> action to take the whole region offline, then repair specific chunks of
> memory before bringing it back online.

Which is userspace racing the kernel to online memory?

> > So, yes, +1 to simpler for now where software effectively just needs to
> > deal with a handful of "region repair" buttons and the semantics of
> > those are coarse and sub-optimal. Wait for a future where a tool author
> > says, "we have had good success getting bulk offlined pages back into
> > service, but now we need this specific finer grained kernel interface to
> > avoid wasting spare banks prematurely".
>
> Depends on where you think that interface is. I can absolutely see that
> as a control to RAS Daemon. Option 2 above, region is offline, repair
> all dodgy looking fine grained buckets.
>
> Note though that a suboptimal repair may mean permanent use of very rare
> resources. So there needs to be a control a the finest granularity as well.
> Which order those get added to userspace tools doesn't matter to me.
>
> If you mean that interface in kernel it brings some non trivial requirements.
> The kernel would need all of:
> 1) Tracking interface for all error records so the reverse map from region
> to specific bank / row etc is available for a subset of entries. The
> kernel would need to know which of those are important (soft offline
> might help in that use case, otherwise that means decision algorithms
> are in kernel or we have fine grained queue for region repair in parallel
> with soft-offline).
> 2) A way to inject the reverse map information from a userspace store
> (to deal with reboot etc).

Not a way to inject the reverse map information, a way to inject the
error records and assert that memory topology changes have not
invalidated those records.

> That sounds a lot harder to deal with than relying on the usespace program
> that already does the tracking across boots.

I am stuck behind the barrier of userspace must not assume it knows
better than the kernel about the SPA impact of a DIMM sparing
event. The kernel needs evidence either live records from within the
same kernel boot or validated records from a previous boot.

...devices could also help us out here with a way to replay DIMM error
events. That would allow for refreshing error records even if the
memory topology change because the new record would generate a refreshed
SPA:HPA:DPA:DIMM tuple.

> > Anything more complex than a set of /sys/devices/system/memory/
> > devices has a /sys/bus/edac/devices/devX/repair button, feels like a
> > generation ahead of where the initial sophistication needs to lie.
> >
> > That said, I do not closely follow ras tooling to say whether someone
> > has already identified the critical need for a fine grained repair ABI?
>
> It's not that we necessarily want to repair at fine grain, it's that
> the control interface to hardware is fine grained and the reverse mapping
> often unknown except for specific error records.
>
> I'm fully on board with simple interfaces for common cases like repair
> the bad memory in this region. I'm just strongly against moving the
> complexity of doing that into the kernel.

Yes, we are just caught up on where that "...but no simpler" line is
drawn.