RE: [PATCH v3] EDAC, mellanox: Add ECC support for BlueField DDR4

From: Junhan Zhou
Date: Thu May 30 2019 - 14:52:16 EST


> -----Original Message-----
> From: James Morse <james.morse@xxxxxxx>
> Sent: Thursday, May 23, 2019 1:30 PM
> To: Junhan Zhou <Junhan@xxxxxxxxxxxx>
> Cc: Borislav Petkov <bp@xxxxxxxxx>; Mauro Carvalho Chehab
> <mchehab@xxxxxxxxxx>; Liming Sun <lsun@xxxxxxxxxxxx>; linux-
> edac@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v3] EDAC, mellanox: Add ECC support for BlueField DDR4
>
> Hi Junhan,
>
> On 21/03/2019 14:31, Junhan Zhou wrote:
> > Add ECC support for Mellanox BlueField SoC DDR controller.
> > This requires SMC to the running Arm Trusted Firmware to report what
> > is the current memory configuration.
>
> Sorry for the delay on this, it slipped through the cracks. (Please don't reply
> with new patches to the discussion of an old patch/series, this makes it look
> like ongoing discussion on a v1, and v2 never arrives!)
>
Sorry about this. But at least you caught me before leaving. Shravan (cc'ed) would be pushing the V4 and beyond versions of this patch.

> > +config EDAC_BLUEFIELD
> > + tristate "Mellanox BlueField Memory ECC"
> > + depends on ARM64 && ((MELLANOX_PLATFORM && ACPI) ||
> COMPILE_TEST)
>
> What is the MELLANOX_PLATFORM needed for? Is it just to turn off a set of
> drivers in one go? I can't see what other infrastructure you depend on.
>
Correct. Our memory controller is only present on the BlueField platform, so it doesn't make sense to be built on other platforms. There were discussions with other maintainers about what config option is used when upstreaming other drivers for BlueField, e.g. having a BLUEFIELD_SOC config. But in the end they settled down on simply using this flag.

> > +union mlxbf_emi_dram_additional_info_0 {
> > + struct {
> > + u32 err_bank : 4;
> > + u32 err_lrank : 2;
> > + u32 __reserved_0 : 2;
> > + u32 err_prank : 2;
> > + u32 __reserved_1 : 6;
> > + u32 err_edge : 8;
> > + u32 __reserved_2 : 8;
> > + };
> > +
> > + u32 word;
> > +};
>
> ... you're expecting the compiler to pack this bitfield in exactly the same way
> your hardware did. I don't think that's guaranteed.
> It evidently works for your current compiler, but another compiler may pack
> this structure differently. Toggling endianness will break this, (arm64
> supports both). If your platform supports aarch32, someone may want to get
> 32bit arm running, which may have different compiler behaviour.
>
> You are also using bitfields between hardware and firmware, so its currently
> possible the firmware requires the kernel to be built with a compiler that
> means it can't interact with the hardware...
>
> When this has come up in the past, the advice was not to use bitfields:
> https://lore.kernel.org/lkml/1077080607.1078.109.camel@gaston/
>
> Please use shifts and masks.
>
Okay if you insist. I do see this kind of style used elsewhere, e.g. in pnd2_edac.h. And this driver is only used on the BlueField SoC which our bootloader would configure it to be only running at Aarch64 little endian.

>
> > +#define MLXBF_EDAC_DIMM_PER_MC 2
> > +#define MLXBF_EDAC_ERROR_GRAIN 8
>
> If these numbers changed, would it still be a BlueField SoC?
> (if next years made-up:BlueField2 supports more Dimms per MC, it would be
> better to make this a property in the firmware table).
>
No it won't happen. The memory controller we are using doesn't have the strength to drive more than 2 DIMMs. If this were ever fixed it in the future, I would anticipate they design a completely new memory controller for which this driver code won't be compatible with.
BTW, how did you ever guess that the next gen chip would be called BlueField2? Did you search for "BlueField" in the registered PCI IDs and accidentally found it? (https://fossies.org/linux/systemd/hwdb/pci.ids)
>
> > +/*
> > + * Request MLNX_SIP_GET_DIMM_INFO
> > + *
> > + * Retrieve information about DIMM on a certain slot.
> > + *
> > + * Call register usage:
> > + * a0: MLNX_SIP_GET_DIMM_INFO
> > + * a1: (Memory controller index) << 16 | (Dimm index in memory
> > +controller)
> > + * a2-7: not used.
> > + *
> > + * Return status:
> > + * a0: dimm_info_smc union defined below describing the DIMM.
> > + * a1-3: not used.
> > + */
>
> Have Mellanox published these call numbers/arguments in a document
> somewhere? If they differ with the firmware, it would be good to know
> which side needs fixing.
>
> It is a little odd that you read the number of memory controllers from the
> ACPI table, but use an SMC call to read the DIMM information.
> Is it too-late to describe the DIMMs in the ACPI table too? (this would let
> firmware hard-code it on platforms where it could never change, instead of
> having to support a runtime call)
>
> The DIMM information should also be in the SMBIOS table. See ghes_edac.c
> for some code that uses this. SMBIOS isn't popular in the arm world: but
> edac already uses this, and we want to match DIMM numbers with the text
> on the board's silk-screen so the user can replace the correct DIMM.
>

No we haven't. But, we publish the source code to our firmware so anybody can edit and boot their own version.
We are using ATF (Arm Trusted Firmware) as the primary firmware (BL1, BL2, BL31) which does the memory training, and EDK2 for the UEFI implementation. If you extract our release tarball (scroll down at http://www.mellanox.com/page/products_dyn?product_family=279&mtag=bluefield_software and download BlueField-2.0.1.10841.tar.xz), you would find the files atf-d48f19.patch and edk2-465be7.patch, which once executed would download the upstream code and then apply our changes on top of it. (Yes, we are also in the process of pushing those code to their respective upstream repositories).
So the issue here is that ATF does the memory training, but UEFI is in charge of publishing the ACPI tables (which is where the SMBIOS table would be located correct?). And the BlueField SoC are used both on boards with soldered DRAMs chips (which doesnât have SPDs) and sockets to insert DIMMs on. So it would be a real mess for UEFI to figure out what is the DIMM configuration, and the most obvious way it to communicate through SMCs. So why go through this extra complicity when Linux can directly do the SMC? Or are you concerned about making this a standard so Windows can also figure out the DIMM configuration?
The number of memory controller as well as which physical address correspond to which memory controller is a fixed fact for the BlueField SoC, so this can be added to the ACPI table very easily without needing to dynamically patching it. Or do you rather this also be an SMC? This number can vary between the current BlueField and "BlueField2".


> > +/* Format for the SMC response about the memory information */ union
> > +dimm_info_smc {
> > + struct {
> > + unsigned long size_GB : 16;
> > + unsigned long is_rdimm : 1;
> > + unsigned long is_lrdimm : 1;
> > + unsigned long is_nvdimm : 1;
> > + unsigned long __reserved0 : 2;
> > + unsigned long ranks : 3;
> > + unsigned long package_X : 8; /* Bits per memory package
> */
> > + unsigned long __reserved1 : 32;
> > + };
> > + unsigned long val;
> > +};
>
> If your firmware and the kernel were built with different compilers, this isn't
> guaranteed to work. Please use shifts and masks.
>

Okay if you insist. I would think if one would tinker with the kernel driver code for BlueField, one might as well also build the firmware code which source we already provided.

> > + writel(edels.word, priv->emi_base +
> > +MLXBF_EMI_DRAM_ECC_LATCH_SELECT);
> > +
> > + /*
> > + * Verify that the ECC reported info in the registers is of the
> > + * same type as the one asked to report. If not, just report the
> > + * error without the detailed information.
> > + */
> > + eds.word = readl(priv->emi_base + MLXBF_EMI_DRAM_SYNDROM);
>
> Does the device need to have seen the write to
> MLXBF_EMI_DRAM_ECC_LATCH_SELECT before it sees this read?
>
> Will Deacon gave a presentation on this stuff at ELCE:
> https://elinux.org/images/a/a8/Uh-oh-Its-IO-Ordering-Will-Deacon-Arm.pdf
>
> (I don't understand this stuff, so may have it totally wrong here:)
>
> From the arch code's definitions of these:
> | #define writel(v,c) ({ __iowmb(); writel_relaxed((v),(c)); })
> | #define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(__v); __v; })
>
> This means you've got back-to-back writel_relaxed()/readl_relaxed() here,
> and probably need an mb() between them.
>
> (slides 17 and 19 of that pdf are handy).
>
> As an example,
> drivers/edac/cell.c::cell_edac_check() has this out_be64(); mb(); in_be64()
> sequence, which I think is the same.
>
>
This is of no concern for the BlueField SoC. For accessing device memory, the hardware enforces a strict memory ordering. The next register access won't be issued before the previous instruction have been retired. So there won't be a case on the BlueField SoC (which is the only platform this driver is valid for) where the latter read happens before the previous write.

> > +
> > + edac_mc_handle_error(ecc_type, mci, error_cnt,
>
> > + ecc_dimm_addr / 0x1000,
>
> Please use PFN_DOWN() to take account of the (configurable!) kernel
> PAGE_SIZE.
>
>
Ok

> > ecc_dimm_addr & 0xfff,
> Please use something like PAGE_MASK, to take account of the PAGE_SIZE.
>
>
Ok

> > + eds.syndrom, ecc_dimm, 0, 0, mci->ctl_name, "");
> }
> > + dimm->nr_pages = smc_info.size_GB * 256 * 1024;
>
> How come PAGE_SIZE doesn't appear here?
> You may want to use SZ_1G and friends as part of the calculation to make it
> more readable.
>
>
Ok
>
>
> Thanks,
>
> James