Re: [PATCH v5 0/2] Incorporate DRAM address in EDAC messages

From: Yazen Ghannam

Date: Thu Jun 11 2026 - 10:11:57 EST

On Wed, Jun 10, 2026 at 06:23:36PM -0700, Borislav Petkov wrote:
> + Mauro.
>
> On Wed, Jun 10, 2026 at 05:15:41PM -0400, Yazen Ghannam wrote:
> > On Tue, Jun 09, 2026 at 05:35:16PM -0700, Borislav Petkov wrote:
> > > On Thu, Jun 04, 2026 at 11:48:56AM -0400, Yazen Ghannam wrote:
> > > > I think the issue is having concurrent updates to mci->error_desc. EDAC
> > > > assumes a single owner for the MCIs. The fake_inject interface basically
> > > > acts as a second module/owner.
> > >
> > > Ok, something like this. Ugly but it should show the intent. It probably needs
> > > more auditing whether we have copied all the fields we need down the
> > > edac_mc_handle_error() path...
> > >
> > > diff --git a/drivers/edac/debugfs.c b/drivers/edac/debugfs.c
> > > index 8195fc9c9354..034b9b3f9cf5 100644
> > > --- a/drivers/edac/debugfs.c
> > > +++ b/drivers/edac/debugfs.c
> > > @@ -14,28 +14,38 @@ static ssize_t edac_fake_inject_write(struct file *file,
> > > struct mem_ctl_info *mci = to_mci(dev);
> > > static enum hw_event_mc_err_type type;
> > > u16 errcount = mci->fake_inject_count;
> > > + struct mem_ctl_info *imci;
> > > +
> > > + imci = kmalloc(sizeof(struct mem_ctl_info), GFP_KERNEL);
> > > + if (!imci)
> > > + return -ENOMEM;
> > > +
> > > + /* use a local copy and copy the struct */
> > > + *imci = *mci;
> > >
> > > if (!errcount)
> > > errcount = 1;
> > >
> > > - type = mci->fake_inject_ue ? HW_EVENT_ERR_UNCORRECTED
> > > - : HW_EVENT_ERR_CORRECTED;
> > > + type = imci->fake_inject_ue ? HW_EVENT_ERR_UNCORRECTED
> > > + : HW_EVENT_ERR_CORRECTED;
> > >
> > > printk(KERN_DEBUG
> > > "Generating %d %s fake error%s to %d.%d.%d to test core handling. NOTE: this won't test the driver-specific decoding logic.\n",
> > > errcount,
> > > (type == HW_EVENT_ERR_UNCORRECTED) ? "UE" : "CE",
> > > str_plural(errcount),
> > > - mci->fake_inject_layer[0],
> > > - mci->fake_inject_layer[1],
> > > - mci->fake_inject_layer[2]
> > > + imci->fake_inject_layer[0],
> > > + imci->fake_inject_layer[1],
> > > + imci->fake_inject_layer[2]
> > > );
> > > - edac_mc_handle_error(type, mci, errcount, 0, 0, 0,
> > > - mci->fake_inject_layer[0],
> > > - mci->fake_inject_layer[1],
> > > - mci->fake_inject_layer[2],
> > > + edac_mc_handle_error(type, imci, errcount, 0, 0, 0,
> > > + imci->fake_inject_layer[0],
> > > + imci->fake_inject_layer[1],
> > > + imci->fake_inject_layer[2],
> > > "FAKE ERROR", "for EDAC testing only");
> > >
> > > + kfree(imci);
> > > +
> > > return count;
> > > }
> > >
> > >
> > >
> >
> > Right, this fixes the race, and printk/trace would be okay. But the
> > normal EDAC interface won't get updated. I think that breaks the fake
> > inject use case.
>
> You have the fake_inject fields, see edac_create_debugfs_nodes(). Those are
> the fake_inject counters which should get incremented and I think we can make
> it work by making the imci thing global and always hand it in to those
> functions...
>

The fake_inject fields are targets.

Ex. under '/sys/debug/kernel/edac/mc0'
fake_inject_count=3
fake_inject_channel=1
fake_inject_csrow=2

...means inject 3 correctable errors to mc0/channel1/csrow2.

This would be reflected in
'/sys/devices/system/edac/mc/mc0/rankX/dimm_ce_count'

rankX = whatever rank the channel/csrow map to.

OFC, the layers are generic. They're based on whatever the EDAC module
uses.

> > Ex.
> > 1) Use fake inject to target layer/severity.
> > 2) Check EDAC interface to see layer/severity counters increase.
> >
> > I don't use the fake inject interface, so I may be mistaken about its
> > use.
>
> ... and neither do I. And I haven't heard of anyone doing so. And that's
> perhaps the case because the thing is behind CONFIG_EDAC_DEBUG and only we
> enable that.
>
> So I'm thinking fake_inject has reached its retiring age. And
>
> 452a6bf955ee ("edac: Add debufs nodes to allow doing fake error inject")
>
> should simply be reverted.
>
> Mauro's on Cc so that he can get a chance to chime in but lemme paste for him
> the issue that Sashiko caught and that, AFAICT, has always been there:

Right, this isn't new behavior. That's why I don't think it's a real
issue. It's a corner case for a specific testing environment.

>
> ---
> > @@ -2748,7 +2768,7 @@ static void __log_ecc_error(struct mem_ctl_info *mci, struct err_info *err,
> > edac_mc_handle_error(err_type, mci, 1,
> > err->page, err->offset, err->syndrome,
> > err->csrow, err->channel, -1,
> > - string, "");
> > + string, s);
> Could passing a stack-allocated buffer to edac_mc_handle_error() risk a
> use-after-free of stack memory?
> Looking at edac_mc_handle_error(), it directly assigns the other_detail
> pointer to the globally shared error_desc structure without holding locks:
> drivers/edac/edac_mc.c:edac_mc_handle_error() {
> ...
> e->other_detail = other_detail ?: "";
> ...
> }
> While normal hardware MCE decoding is serialized by the mce_work workqueue,
> the EDAC fake_inject debugfs interface can execute concurrently.
> Could a user writing to fake_inject race to modify mci->error_desc?
> Thread 1 (mce_work):
> __log_ecc_error()
> char s[MSG_SIZE];
> edac_mc_handle_error(..., s)
> e->other_detail = s;
> If __log_ecc_error() returns and its stack frame is popped, and Thread 2
> (fake_inject) is concurrently executing edac_raw_mc_handle_error(), could
> it read the dangling pointer from e->other_detail?
> ---
>
> (from https://sashiko.dev/#/patchset/20260518160716.171578-1-yazen.ghannam%40amd.com)
>
> If this were to be done right, then that passing in of a string buffer
> allocated in the mci would mean that all those functions working on it down
> the callpath cannot ever be called concurrently.
>
> Or if they have to, a separate mci would have to be handed in. Either way,
> that would need hairy rework and right now, the simplest thing to do is simply
> zap that fake_inject stuff which is not used anyway.
>

Just curious, why not leave it? Do you think the bots will keep bringing
it up?

Thanks,
Yazen