Re: [PATCH] EDAC/versal: Report PFN and page offset for DDR errors

From: Srivatsa S. Bhat

Date: Mon Apr 27 2026 - 14:50:59 EST

On Mon, Apr 27, 2026 at 06:27:36AM +0000, Datta, Shubhrajyoti wrote:
> Public
>
> > -----Original Message-----
> > From: Srivatsa S. Bhat <srivatsa@xxxxxxxxxxxxx>
> > Sent: Wednesday, April 22, 2026 9:20 AM
> > To: Datta, Shubhrajyoti <shubhrajyoti.datta@xxxxxxx>
> > Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-edac@xxxxxxxxxxxxxxx; git (AMD-Xilinx)
> > <git@xxxxxxx>; ptsm@xxxxxxxxxxxxxxxxxxx; shubhrajyoti.datta@xxxxxxxxx;
> > Borislav Petkov <bp@xxxxxxxxx>; Tony Luck <tony.luck@xxxxxxxxx>
> > Subject: Re: [PATCH] EDAC/versal: Report PFN and page offset for DDR errors
> >
> > Caution: This message originated from an External Source. Use proper caution
> > when opening attachments, clicking links, or responding.
> >
> >
> > On Wed, Apr 15, 2026 at 11:32:39AM +0530, Shubhrajyoti Datta wrote:
> > > Currently, DDRMC correctable and uncorrectable error events are
> > > reported to EDAC with page frame number (pfn) and offset set to zero.
> > > This information is not useful to locate the address for memory errors.
> > >
> > > Compute the physical address from the error information and extract
> > > the page frame number and offset before calling edac_mc_handle_error().
> > > This provides the actual memory location information to the userspace.
> > >
> > > Fixes: 6f15b178cd63 ("EDAC/versal: Add a Xilinx Versal memory
> > > controller driver")
> > > Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@xxxxxxx>
> > > ---
> > >
> > > drivers/edac/versal_edac.c | 36 +++++++++++++++++-------------------
> > > 1 file changed, 17 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/drivers/edac/versal_edac.c b/drivers/edac/versal_edac.c
> > > index 5a43b5d43ca2..18045f96610e 100644
> > > --- a/drivers/edac/versal_edac.c
> > > +++ b/drivers/edac/versal_edac.c
> > > @@ -414,34 +414,32 @@ static unsigned long convert_to_physical(struct
> > > edac_priv *priv, union ecc_error static void handle_error(struct
> > > mem_ctl_info *mci, struct ecc_status *stat) {
> >
> > [...]
> >
> > > if (stat->error_type == XDDR_ERR_TYPE_CE) {
> >
> > [...]
> >
> > > + } else if (stat->error_type == XDDR_ERR_TYPE_UE) {
> >
> > [...]
> > > + } else {
> > > + return;
> >
> > I like the cleanup contributed by this patch (in terms of reducing code
> > duplication) in addition to the actual fix. However, this patch also introduces a
> > subtle behavior change - the existing code calls
> > memset() to clear out the ecc_status struct unconditionally, but this patch
> > doesn't call memset if the error type is not CE or UE (i.e., in the early return
> > path).
> >
> > Was this change intentional? Wouldn't it potentially cause stale data to be left
> > over in the ecc_status struct, affecting future reuse?
>
> Hi Srivatsa,
> Thanks for the review.
>
> The early return in handle_error is intentionally safe.
> get_error_info is always called before handle_error:
>
> if (get_error_info(priv))
> return;
> handle_error(mci, &priv->stat);
>
> get_error_info already guards against non-CE/UE events:
>
> if (!eccr0_ceval && !eccr1_ceval && !eccr0_ueval && !eccr1_ueval)
> return 1;
>
> So handle_error is only ever reached when at least one CE or UE
> error is present, making the else { return; } branch unreachable
> in practice. The removed memset in that path was overprotective
> dead code.
>
> For future events, get_error_info always reloads the data fresh
> before handle_error runs, so there is no risk of stale data.
>

I see, thank you for the clarification!

> I can add a comment above handle_error noting this precondition if
> that would help future readers.
>

Sure, that would be great! Along with that, how about also removing
the else { return; } branch if it is actually unreachable?

Regards,
Srivatsa
Microsoft Linux Systems Group