RE: [PATCH v5 5/5] EDAC: Versal NET: Add support for error notification
From: Datta, Shubhrajyoti
Date: Wed Mar 05 2025 - 00:18:38 EST
[AMD Official Use Only - AMD Internal Distribution Only]
> -----Original Message-----
> From: Borislav Petkov <bp@xxxxxxxxx>
> Sent: Monday, March 3, 2025 11:25 PM
> To: Datta, Shubhrajyoti <shubhrajyoti.datta@xxxxxxx>
> Cc: Krzysztof Kozlowski <krzk@xxxxxxxxxx>; Rob Herring <robh@xxxxxxxxxx>;
> Conor Dooley <conor+dt@xxxxxxxxxx>; Tony Luck <tony.luck@xxxxxxxxx>; James
> Morse <james.morse@xxxxxxx>; Mauro Carvalho Chehab
> <mchehab@xxxxxxxxxx>; Robert Richter <rric@xxxxxxxxxx>; linux-
> kernel@xxxxxxxxxxxxxxx; devicetree@xxxxxxxxxxxxxxx; linux-edac@xxxxxxxxxxxxxxx; git
> (AMD-Xilinx) <git@xxxxxxx>
> Subject: Re: [PATCH v5 5/5] EDAC: Versal NET: Add support for error notification
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> On Thu, Feb 27, 2025 at 11:32:10AM +0000, Datta, Shubhrajyoti wrote:
> > > > +union edac_info {
> > >
> > > What is an "edac_info"?
> > This is the row and column positions.
> > We are using it to extract the position from the address decoder registers.
>
> Needs a better name which is descriptive as to how it is used or what it represents.
Agreed will try to name row_col_mapping or addr_decoder_info.
>
> > > > +static void get_ddr_ue_error_info(u32
> > > > +error_data[REGS_PER_CONTROLLER],
> > > struct edac_priv *priv)
> > >
> > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > What is that for?
> > >
> > The error_data contains the register values. Linux does not have
> > access to the DDRMC register Space. It queries it from the NMC and gets the the
> data from the rpmsg callback.
>
> I wasn't clear. Arrays in C are passed as pointers - the compiler does that anyway.
> You don't have to do this weird parameter specification.
Will do thanks.
>
> > > > + mci->edac_cap = EDAC_FLAG_SECDED;
> > > > + mci->ctl_name = "amd_ddr_controller";
> > > > + mci->dev_name = dev_name(&pdev->dev);
> > > > + mci->mod_name = "amd_edac";
> > >
> > > Do:
> > >
> > > git grep mod_name drivers/edac/
> > >
> > > to get an idea how those names are chosen.
> > #define EDAC_MOD_STR "r82600_edac"
> > mci->mod_name = EDAC_MOD_STR;
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > /tree/drivers/edac/r82600_edac.c?h=v6.14-rc4#n316
> >
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > /tree/drivers/edac/i5000_edac.c?h=v6.14-rc4#n1424
> > mci->mod_name = "i5000_edac.c";
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > /tree/drivers/edac/highbank_mc_edac.c?h=v6.14-rc4#n218
> > mci->mod_name = pdev->dev.driver->name;
> >
> > let me know if mci->mod_name = pdev->dev.driver->name; is fine.
>
> I think you didn't get me again.
>
> amd64_edac.c - the x86 driver is called this:
>
> #define EDAC_MOD_STR "amd64_edac"
>
> Calling yours "amd_edac" doesn't work.
>
> "versalnet_edac"? That's probably better.
Will do.
>
> > > You don't need "inline" - the compiler can decide that itself. And
> > > "process_bit" needs a better name.
> >
> > Will rename it to populate_row_bit
>
> Or "assign_row_bit" or whatever.
>
> The function name should be describing what the function does as closely as
> possible.
>
> > > Why are those functions copying stuff around? Why aren't you using
> > > cols directly?
> >
> > The column bit position is used in converting to the physical address.
> > We read it at init and use it every time an error occurs to get the address.
> > Did you mean to remove the regval. Or read the error_data every time.
>
> I mean simply use cols instead of assigning stuff to priv->col_bit* and then using
> that.
Will do
>
> > > Why is probing a work item?
> > >
> > > Explaining *that* is what a commit message is for - not for
> > > repeating useless info.
> > The RPMsg probe is invoked from a thread within the virtio driver
> > responsible for processing the response ring. If the probe initiates
> > an mcdi API call, it blocks until the mcdi response is received.
> > However, since the mcdi response is also processed by the same thread
> > that triggered the rpmsg probe, the thread remains blocked, preventing it from
> handling the response. This results in a deadlock.
> >
> > To prevent it we have a work scheduled.
>
> This is just insane.
>
> I don't see anything in amd_setup_mcdi() that needs some response from some
> mcdi thing. If not, you don't need the work queue thing.
The amd_setup_mcdi calls get_ddr_config this sends and rpc (calling cdx_mcdi_rpc) to get the ddr configuration.
>
> > >
> > > > + for (i = 0; i < NUM_CONTROLLERS; i++) {
> > > > + config = priv->adec[CONF + i];
> > > > + num_chans = FIELD_GET(DDRMC5_NUM_CHANS_MASK,
> config);
> > > > + rank = FIELD_GET(DDRMC5_RANK_MASK, config);
> > > > + rank = 1 << rank;
> > > > + dwidth = FIELD_GET(DDRMC5_BUS_WIDTH_MASK, config);
> > > > + dt = get_dwidth(dwidth);
> > > > + if (dt != DEV_UNKNOWN)
> > > > + break;
> > > > + }
> > >
> > > What is that loop supposed to do? Find the last controller before
> > > the one with DEV_UNKNOWN device width and register that one?
> >
> > There are 8 controllers all we try to get the first one that is enabled and register
> that one.
> > We use the device unknown to know if that is enabled or not.
>
> The first one that is enabled has unknown device width? What?
The loop iterates through all 8 controllers, checking their configuration registers.
It looks for the first enabled controller by verifying if its device width (dt) is not DEV_UNKNOWN.
Once it finds the first valid controller, it breaks out of the loop and registers that one
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette