RE: [PATCH v1 1/1] EDAC, mellanox: Add ECC support for BlueField DDR4

From: Junhan Zhou
Date: Mon Mar 18 2019 - 14:40:09 EST


> > +config EDAC_BLUEFIELD
> > + tristate "Mellanox BlueField Memory ECC"
> > + depends on (MELLANOX_PLATFORM && ARM64 && ACPI) ||
> COMPILE_TEST
>
> MELLANOX_PLATFORM depends on X86 || ARM but this depends on ARM64.
> How is that even supposed to work?

We are pushing to add driver support in the mainline kernel for our A72 (ARM64) based BlueField SoC, and after debates decided to use the MELLANOX_PLATFORM option for this.
My colleague is also in the process of upstreaming a driver which makes MELLANOX_PLATFORM depends on ARM64, you can see the patch at https://lore.kernel.org/lkml/1552056084-12137-1-git-send-email-lsun@xxxxxxxxxxxx/
If you want, I can also add the change to this patch, but I would prefer to only have the change in one patch.

>
> > + help
> > + Support for error detection and correction on the
> > + Mellanox BlueField SoCs.
> > +
> > endif # EDAC
>
> This patch needs MAINTAINERS entry so that you can get CCed on bug
> reports.

Done. Added myself.

> > +union MLXBF_EMI_DRAM_ECC_COUNT_u {
>
> No all caps names pls. And no "_u" suffixing - this is not C++.

Done. Removed the _u suffix and changed all union declarations to lower case.

> > + struct {
> > + /* ECC single errors counter */
> > + u32 single_error_count : 16;
>
> You have excessive comments for members whose names already describe
> what they are. Remove the comments and descrease the clutter this way.

Sorry, this was autogenerated from the register hardware definitions. Removed all except for the ECC syndrome one as it does have added information of saying it's error bit according to the Hamming code.

>
> > +union MLXBF_EMI_DRAM_ECC_ERROR_u {
> > + struct {
> > + /* 1 - ECC single error from DRAM wrapper x. */
>
> What's a "DRAM wrapper x" which is being mentioned here and there?

Sorry about this, removed. It would be more confusing if just left here.
The story is this hardware IP was originally used for the Ezchip NPS memory controller where there were a lot of separate DRAM groups or "wrappers", and each bit here would corresponding to one of it. But when moving to BlueField it was decided to get rid of all these wrappers. But apparently the description wasn't updated.

> > + u32 dram_ecc_single : 1;
> > + /* Reserved. */
> All those "Reserved" comments are completely useless.
All removed.
> > + * Retrieve information about DIMM on a certaion slot.
> ^^^^^^^^
> Run your whole patch through a spellchecker.
Good tip. Run it through aspell. Though there are a lot of false positives, especially for the git hashes. Any tips on this?
BTW, it did pick up "syndrom" as an error where the correct spelling is "syndrome". But the hardware register is named that way so I kept that alone.
> > + if (edec.single_error_count != 0) {
> if (edec.single_error_count) {
> is enough. Below too.
Done.
> > + if (smc_info.size_GB == 0) {
>
> if (!...
>
Done

> > + /* Only POLL mode supported so far. */
> > + edac_op_state = EDAC_OPSTATE_POLL;
>
> This needs to happen *last* in this function, when everything else succeeds.
>
Done
> > + if (mci == NULL)
>
> if (!mci)
>
Done

Thanks for your comments! Just posted a V2 patch addressing them.