Re: [PATCH v5 0/2] Incorporate DRAM address in EDAC messages

From: Borislav Petkov

Date: Wed Jun 10 2026 - 21:24:10 EST

+ Mauro.

On Wed, Jun 10, 2026 at 05:15:41PM -0400, Yazen Ghannam wrote:
> On Tue, Jun 09, 2026 at 05:35:16PM -0700, Borislav Petkov wrote:
> > On Thu, Jun 04, 2026 at 11:48:56AM -0400, Yazen Ghannam wrote:
> > > I think the issue is having concurrent updates to mci->error_desc. EDAC
> > > assumes a single owner for the MCIs. The fake_inject interface basically
> > > acts as a second module/owner.
> >
> > Ok, something like this. Ugly but it should show the intent. It probably needs
> > more auditing whether we have copied all the fields we need down the
> > edac_mc_handle_error() path...
> >
> > diff --git a/drivers/edac/debugfs.c b/drivers/edac/debugfs.c
> > index 8195fc9c9354..034b9b3f9cf5 100644
> > --- a/drivers/edac/debugfs.c
> > +++ b/drivers/edac/debugfs.c
> > @@ -14,28 +14,38 @@ static ssize_t edac_fake_inject_write(struct file *file,
> > struct mem_ctl_info *mci = to_mci(dev);
> > static enum hw_event_mc_err_type type;
> > u16 errcount = mci->fake_inject_count;
> > + struct mem_ctl_info *imci;
> > +
> > + imci = kmalloc(sizeof(struct mem_ctl_info), GFP_KERNEL);
> > + if (!imci)
> > + return -ENOMEM;
> > +
> > + /* use a local copy and copy the struct */
> > + *imci = *mci;
> >
> > if (!errcount)
> > errcount = 1;
> >
> > - type = mci->fake_inject_ue ? HW_EVENT_ERR_UNCORRECTED
> > - : HW_EVENT_ERR_CORRECTED;
> > + type = imci->fake_inject_ue ? HW_EVENT_ERR_UNCORRECTED
> > + : HW_EVENT_ERR_CORRECTED;
> >
> > printk(KERN_DEBUG
> > "Generating %d %s fake error%s to %d.%d.%d to test core handling. NOTE: this won't test the driver-specific decoding logic.\n",
> > errcount,
> > (type == HW_EVENT_ERR_UNCORRECTED) ? "UE" : "CE",
> > str_plural(errcount),
> > - mci->fake_inject_layer[0],
> > - mci->fake_inject_layer[1],
> > - mci->fake_inject_layer[2]
> > + imci->fake_inject_layer[0],
> > + imci->fake_inject_layer[1],
> > + imci->fake_inject_layer[2]
> > );
> > - edac_mc_handle_error(type, mci, errcount, 0, 0, 0,
> > - mci->fake_inject_layer[0],
> > - mci->fake_inject_layer[1],
> > - mci->fake_inject_layer[2],
> > + edac_mc_handle_error(type, imci, errcount, 0, 0, 0,
> > + imci->fake_inject_layer[0],
> > + imci->fake_inject_layer[1],
> > + imci->fake_inject_layer[2],
> > "FAKE ERROR", "for EDAC testing only");
> >
> > + kfree(imci);
> > +
> > return count;
> > }
> >
> >
> >
>
> Right, this fixes the race, and printk/trace would be okay. But the
> normal EDAC interface won't get updated. I think that breaks the fake
> inject use case.

You have the fake_inject fields, see edac_create_debugfs_nodes(). Those are
the fake_inject counters which should get incremented and I think we can make
it work by making the imci thing global and always hand it in to those
functions...

> Ex.
> 1) Use fake inject to target layer/severity.
> 2) Check EDAC interface to see layer/severity counters increase.
>
> I don't use the fake inject interface, so I may be mistaken about its
> use.

... and neither do I. And I haven't heard of anyone doing so. And that's
perhaps the case because the thing is behind CONFIG_EDAC_DEBUG and only we
enable that.

So I'm thinking fake_inject has reached its retiring age. And

452a6bf955ee ("edac: Add debufs nodes to allow doing fake error inject")

should simply be reverted.

Mauro's on Cc so that he can get a chance to chime in but lemme paste for him
the issue that Sashiko caught and that, AFAICT, has always been there:

---
> @@ -2748,7 +2768,7 @@ static void __log_ecc_error(struct mem_ctl_info *mci, struct err_info *err,
> edac_mc_handle_error(err_type, mci, 1,
> err->page, err->offset, err->syndrome,
> err->csrow, err->channel, -1,
> - string, "");
> + string, s);
Could passing a stack-allocated buffer to edac_mc_handle_error() risk a
use-after-free of stack memory?
Looking at edac_mc_handle_error(), it directly assigns the other_detail
pointer to the globally shared error_desc structure without holding locks:
drivers/edac/edac_mc.c:edac_mc_handle_error() {
...
e->other_detail = other_detail ?: "";
...
}
While normal hardware MCE decoding is serialized by the mce_work workqueue,
the EDAC fake_inject debugfs interface can execute concurrently.
Could a user writing to fake_inject race to modify mci->error_desc?
Thread 1 (mce_work):
__log_ecc_error()
char s[MSG_SIZE];
edac_mc_handle_error(..., s)
e->other_detail = s;
If __log_ecc_error() returns and its stack frame is popped, and Thread 2
(fake_inject) is concurrently executing edac_raw_mc_handle_error(), could
it read the dangling pointer from e->other_detail?
---

(from https://sashiko.dev/#/patchset/20260518160716.171578-1-yazen.ghannam%40amd.com)

If this were to be done right, then that passing in of a string buffer
allocated in the mci would mean that all those functions working on it down
the callpath cannot ever be called concurrently.

Or if they have to, a separate mci would have to be handed in. Either way,
that would need hairy rework and right now, the simplest thing to do is simply
zap that fake_inject stuff which is not used anyway.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette