RE: [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc.

From: Ghannam, Yazen
Date: Fri Nov 01 2019 - 11:19:42 EST


> -----Original Message-----
> From: Borislav Petkov <bp@xxxxxxxxx>
> Sent: Friday, October 25, 2019 9:35 AM
> To: Ghannam, Yazen <Yazen.Ghannam@xxxxxxx>
> Cc: linux-edac@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v2 0/6] AMD64 EDAC: Check for nodes without memory, etc.
>
> On Tue, Oct 22, 2019 at 08:35:08PM +0000, Ghannam, Yazen wrote:
> > From: Yazen Ghannam <yazen.ghannam@xxxxxxx>
> >
> > Hi Boris,
> >
> > Most of these patches address the issue where the module checks and
> > complains about DRAM ECC on nodes without memory.
> >
> > Thanks,
> > Yazen
> >
> > Link:
> > https://lkml.kernel.org/r/20191018153114.39378-1-Yazen.Ghannam@xxxxxxx
> >
> > Yazen Ghannam (6):
> > EDAC/amd64: Make struct amd64_family_type global
> > EDAC/amd64: Gather hardware information early
> > EDAC/amd64: Save max number of controllers to family type
> > EDAC/amd64: Use cached data when checking for ECC
> > EDAC/amd64: Check for memory before fully initializing an instance
> > EDAC/amd64: Set grain per DIMM
> >
> > drivers/edac/amd64_edac.c | 196 +++++++++++++++++++-------------------
> > drivers/edac/amd64_edac.h | 2 +
> > 2 files changed, 100 insertions(+), 98 deletions(-)
>
> Almost there: now it dumps the whole shebang twice. This is on an old
> F10h box which doesn't have ECC DIMMs:
>
> [ 2.222853] EDAC MC: Ver: 3.0.0
> [ 2.226881] EDAC DEBUG: edac_mc_sysfs_init: device mc created
> [ 5.726912] EDAC amd64: F10h detected (node 0).
...
> [ 6.208087] EDAC amd64: F10h detected (node 0).

Is the module being probed twice? We have this problem in general, e.g. the
module gets loaded multiple times on failure.

The clue for me is that node 0 gets detected twice. This is done in
per_family_init() early in probe_one_instance().

In any case, I think we can make !ecc_enabled(pvt) in probe_one_instance() a
failure now that we have an explicit check for memory on a node. In other
words, if we have memory and ECC is disabled then this is a failure for the
module.

Thanks,
Yazen