Re: [PATCH 14/21] EDAC, ghes: Extract numa node information for each dimm

From: Robert Richter
Date: Thu Jun 13 2019 - 16:58:53 EST

Next message: Evan Green: "Re: [PATCH] platform/chrome: Expose resume result via sysfs"
Previous message: Nayna Jain: "[PATCH 1/2] powerpc/powernv: add OPAL APIs for secure variables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi James,

thank you for your review and response here. See my comments below.

On 29.05.19 18:51:00, James Morse wrote:
> On 29/05/2019 09:44, Robert Richter wrote:
> > In a later patch we want to have one mc device per node. This patch
> > extracts the numa node information for each dimm. This is done by
> > collecting the physical address ranges from the DMI table (Memory
> > Array Mapped Address - Type 19 of SMBIOS spec). The node information> for a physical address is already know to a numa aware system (e.g. by
> > using the ACPI _PXM method or the ACPI SRAT table), so based on the PA
> > we can assign the node id to the dimms.
>
> I think you're letting the smbios information drive you here. We'd like to do as much as
> possible without it, all its really good for is telling us the label on the PCB.
>
> With this approach, you only get numa information by parsing more smbios, which we have to
> try and validate, and fall back to some error path if it smells wrong. We end up needing
> things like a 'fallback memory controller' in the case the firmware fault-time value is
> missing, or nuts.
>
> What bugs me is we already know the numa information from the address. We could expose
> that without the smbios tables at all, and it would be useful to someone playing the
> dimm-bisect game. Not making it depend on smbios means there is a good chance it can
> become common with other edac drivers.

What a ghes driver will never have common with other edac drivers
is the knowledge of the memory hierarchy. Other drivers know the
underlying hardware and can determine the total number of dimms and
their location mapping indicated by a tuple (card/module, row/channel,
top_layer/mid_layer, etc.) using something like:

index = top_layer * mid_layer_size + mid_layer;

The ghes driver cannot calculate a dimm index in that way since the
size of each layer is unknown. This only leaves you using the
dmi_handle from the error record to do the dimm mapping. I don't see
any other way here.

>
> I don't think we need to know the dimm->node mapping up front. When we get an error,
> pfn_to_nid() on the address tells us which node that memory is attached to. This should be
> the only place nid information comes from, that way we don't need to check it. Trying to
> correlate it with smbios tables is much more code. If the CPER comes with a DIMM handle,
> we know the DIMM too.

The dimm/node mapping is not the issue here and we could also use the
phys addr to select the node's memory controller. But we still need to
be able to somehow select the dimm the error belongs to. The ghes
driver cannot use the location tuple here to get the dimm index in the
mc's array.

> So all we really need is to know at setup time is how many numa-nodes there are, and the
> maximum DIMM per node if we don't want phantom-dimms. Type-17 already has a
> Physical-Memory-Array-Handle, which points at Type-19... but we don't need to read it,
> just count them and find the biggest.
>
> If the type-19 information is missing from smbios, or not linked up properly, or the
> values provided at fault-time don't overlap with the values in the table, or there is no
> fault-time node information: you still get the numa-node information based on the address.
>
> Using the minimum information should give us the least code, and the least exposure to
> misdescribed tables.

As said, we need the firmware here to locate the correct dimm an error
is reported for. I also would like to use the information of the
smbios at a minimum, but we rely on correct firmware tables here.
Assuming a broken fw and still having a correct driver does not work.
Why not just blame the firmware if something is wrong? I am sure it
will be corrected if edac does not properly work.

>
>
> > A fallback that disables numa is implemented in case the node
> > information is inconsistent.
>
> > diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> > index 50f4ee36b755..083452a48b42 100644
> > --- a/drivers/edac/ghes_edac.c
> > +++ b/drivers/edac/ghes_edac.c
> > @@ -67,14 +67,34 @@ struct memdev_dmi_entry {
> > u16 conf_mem_clk_speed;
> > } __attribute__((__packed__));
> >
> > +/* Memory Array Mapped Address - Type 19 of SMBIOS spec */
> > +struct memarr_dmi_entry {
> > + u8 type;
> > + u8 length;
> > + u16 handle;
> > + u32 start;
> > + u32 end;
> > + u16 phys_mem_array_handle;
> > + u8 partition_width;
> > + u64 ext_start;
> > + u64 ext_end;
> > +} __attribute__((__packed__));
>
> Any chance we could collect the structures from the smbios spec in a header file somewhere?

I could create a new ghes_edac.h file, but the only user is
ghes_edac.c? Does not make sense to me.

>
> I'd prefer not to read this thing at all if we can help it.

I don't see how else we identify the dimm other than the phys addr
range and the smbios handle?

>
> > struct ghes_dimm_info {
> > struct dimm_info dimm_info;
> > int idx;
> > + int numa_node;
>
> (I thought nid was the preferred term)

struct device uses numa_node here, so I chose this one.

>
>
> > + phys_addr_t start;
> > + phys_addr_t end;
>
> I think start and end are deceptive as they overlap with other DIMMs on systems with
> interleaving memory controllers. I'd prefer not to keep these values around.

The (start) address is only used for dimm/node mapping.

>
>
> > + u16 phys_handle;
> > };
> >
> > struct ghes_mem_info {
> > - int num_dimm;
> > + int num_dimm;
> > struct ghes_dimm_info *dimms;
> > + int num_nodes;
>
> > + int num_per_node[MAX_NUMNODES];
>
> Number of what?

dimms_per_node, will change.

>
>
> > + bool enable_numa;
>
> This is used locally in mem_info_setup(), but its not read from here by any of the later
> patches in the series. Is it needed?

No, not really, will remove it.

>
> I don't like the idea that this is behaviour that is turned on/off. Its a property of the
> system. I think it would be better if CONFIG_NUMA causes you to get multiple
> memory-controllers created, but if its not actually a NUMA machine there would only be
> one. This lets us test that code on not-really-numa systems.

There is only one node if CONFIG_NUMA is disabled and only one mc is
created.

We disable per-node memory controllers only if the node id cannot be
determined properly for some reason.

>
>
> > };
> >
> > struct ghes_mem_info mem_info;
> > @@ -97,10 +117,50 @@ static void ghes_dimm_info_init(void)
> >
> > for_each_dimm(dimm) {
> > dimm->idx = idx;
> > + dimm->numa_node = NUMA_NO_NODE;
> > idx++;
> > }
> > }
> >
> > +static void ghes_edac_set_nid(const struct dmi_header *dh, void *arg)
> > +{
> > + struct memarr_dmi_entry *entry = (struct memarr_dmi_entry *)dh;
> > + struct ghes_dimm_info *dimm;
> > + phys_addr_t start, end;
> > + int nid;
> > +
> > + if (dh->type != DMI_ENTRY_MEM_ARRAY_MAPPED_ADDR)
> > + return;
>
> > + /* only support SMBIOS 2.7+ */
> > + if (entry->length < sizeof(*entry))
> > + return;
>
> Lovely. I still hope we can get away without parsing this table.
>
>
> > + if (entry->start == 0xffffffff)
> > + start = entry->ext_start;
> > + else
> > + start = entry->start;
> > + if (entry->end == 0xffffffff)
> > + end = entry->ext_end;
> > + else
> > + end = entry->end;
>
>
> > + if (!pfn_valid(PHYS_PFN(start)))
> > + return;
>
> Eh? Just because there is no struct page doesn't mean firmware won't report errors for it.
> This is going to bite on arm64 if the 'start' page happens to have been reserved by
> firmware, and thus doesn't have a struct page. Bottom-up allocation doesn't sound unlikely.

It looks like the memblock areas have a finer granularity than the
memory ranges in the DMI and SRAT table. DMI and SRAT have the same
areas on the system that I have used for testing.

SRAT is the area used to setup pfns. If that maps with the dmi table
and the memblocks are within that range, I don't see an issue.

The fallback would be the node is not detectable and per-node mc
allocation is disabled.

----
# dmidecode | grep -A 5 'Memory Array Mapped Address'
Memory Array Mapped Address
Starting Address: 0x0000000080000000k
Ending Address: 0x00000000FEFFFFFFk
Range Size: 2032 MB
Physical Array Handle: 0x0037
Partition Width: 1
--
Memory Array Mapped Address
Starting Address: 0x0000000880000000k
Ending Address: 0x0000000FFFFFFFFFk
Range Size: 30 GB
Physical Array Handle: 0x0037
Partition Width: 1
--
Memory Array Mapped Address
Starting Address: 0x0000008800000000k
Ending Address: 0x0000009FFCFFFFFFk
Range Size: 98256 MB
Physical Array Handle: 0x0037
Partition Width: 1
--
Memory Array Mapped Address
Starting Address: 0x0000009FFD000000k
Ending Address: 0x000000BFFCFFFFFFk
Range Size: 128 GB
Physical Array Handle: 0x004E
Partition Width: 1
# dmesg | grep SRAT:.*mem
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x80000000-0xfeffffff]
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x880000000-0xfffffffff]
[ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x8800000000-0x9ffcffffff]
[ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x9ffd000000-0xbffcffffff]
# dmesg
[...]
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00000000802f0000-0x000000008030ffff]
[ 0.000000] node 0: [mem 0x0000000080310000-0x00000000bfffffff]
[ 0.000000] node 0: [mem 0x00000000c0000000-0x00000000c0ccffff]
[ 0.000000] node 0: [mem 0x00000000c0cd0000-0x00000000f95effff]
[ 0.000000] node 0: [mem 0x00000000f95f0000-0x00000000f961ffff]
[ 0.000000] node 0: [mem 0x00000000f9620000-0x00000000fac3ffff]
[ 0.000000] node 0: [mem 0x00000000fac40000-0x00000000faddffff]
[ 0.000000] node 0: [mem 0x00000000fade0000-0x00000000fc8dffff]
[ 0.000000] node 0: [mem 0x00000000fc8e0000-0x00000000fc8effff]
[ 0.000000] node 0: [mem 0x00000000fc8f0000-0x00000000fcaaffff]
[ 0.000000] node 0: [mem 0x00000000fcab0000-0x00000000fcacffff]
[ 0.000000] node 0: [mem 0x00000000fcad0000-0x00000000fcb4ffff]
[ 0.000000] node 0: [mem 0x00000000fcb50000-0x00000000fd1fffff]
[ 0.000000] node 0: [mem 0x00000000fd200000-0x00000000fecfffff]
[ 0.000000] node 0: [mem 0x00000000fed00000-0x00000000fed2ffff]
[ 0.000000] node 0: [mem 0x00000000fed30000-0x00000000fed3ffff]
[ 0.000000] node 0: [mem 0x00000000fed40000-0x00000000fedeffff]
[ 0.000000] node 0: [mem 0x00000000fedf0000-0x00000000feffffff]
[ 0.000000] node 0: [mem 0x0000000880000000-0x0000000fffffffff]
[ 0.000000] node 0: [mem 0x0000008800000000-0x0000009ffcffffff]
[ 0.000000] node 1: [mem 0x0000009ffd000000-0x000000bffcffffff]
[...]
----

>
>
> > + nid = pfn_to_nid(PHYS_PFN(start));
>
> ... Ugh, because pfn_to_nid() goes via struct page.
>
> You can make this robust by scanning start->end looking for a pfn_valid() you can pull the
> nid out of. (no, I don't think its a good idea either!)
>
> I'd like to see if we can get rid of the 'via address' part of this.
>
>
> > + if (nid < 0 || nid >= MAX_NUMNODES || !node_possible(nid))
> > + nid = NUMA_NO_NODE;
>
> Can this happen? Does this indicate the firmware tables are wrong, or mm is about derail?

It's a range check, pfn_to_nid() is implementation defined, just make
sure things are as expected.

>
>
> > + for_each_dimm(dimm) {
> > + if (entry->phys_mem_array_handle == dimm->phys_handle) {
> > + dimm->numa_node = nid;
> > + dimm->start = start;
> > + dimm->end = end;
> > + }
> > + }
> > +}
> > +
> > static int get_dimm_smbios_index(u16 handle)
> > {
> > struct mem_ctl_info *mci = ghes_pvt->mci;
> > @@ -213,8 +273,25 @@ static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
> > }
> > }
> >
> > +static void mem_info_disable_numa(void)
> > +{
> > + struct ghes_dimm_info *dimm;
> > +
> > + for_each_dimm(dimm) {
> > + if (dimm->numa_node != NUMA_NO_NODE)
> > + mem_info.num_per_node[dimm->numa_node] = 0;
>
> > + dimm->numa_node = 0;
>
> NUMA_NO_NODE?

No, this is the index to the one and only mem controller that we have
with numa disabled for edac.

>
> > + }
> > +
> > + mem_info.num_per_node[0] = mem_info.num_dimm;
> > + mem_info.num_nodes = 1;
> > + mem_info.enable_numa = false;
> > +}
> > +
> > static int mem_info_setup(void)
> > {
> > + struct ghes_dimm_info *dimm;
> > + bool enable_numa = true;
> > int idx = 0;
> >
> > memset(&mem_info, 0, sizeof(mem_info));
> > @@ -231,6 +308,29 @@ static int mem_info_setup(void)
> >
> > ghes_dimm_info_init();
> > dmi_walk(ghes_edac_dmidecode, &idx);
> > + dmi_walk(ghes_edac_set_nid, NULL);
> > +
> > + for_each_dimm(dimm) {
> > + if (dimm->numa_node == NUMA_NO_NODE) {
> > + enable_numa = false;
> > + } else {
>
> > + if (!mem_info.num_per_node[dimm->numa_node])
> > + mem_info.num_nodes++;
>
> This is to try and hide empty nodes?

This is consumed nowhere and can be removed.

>
>
> > + mem_info.num_per_node[dimm->numa_node]++;
>
> Could you do these two in your previous for_each_dimm() walk?

This must be called after the ghes_edac_set_nid walker.

>
>
> > + }
> > +
> > + edac_dbg(1, "DIMM%i: Found mem range [%pa-%pa] on node %d\n",
> > + dimm->idx, &dimm->start, &dimm->end, dimm->numa_node);
> > + }
>
>
> > + mem_info.enable_numa = enable_numa;
> > + if (enable_numa)
> > + return 0;
> > +
> > + /* something went wrong, disable numa */
> > + if (num_possible_nodes() > 1)
> > + pr_warn("Can't get numa info, disabling numa\n");
> > + mem_info_disable_numa();
>
> I'd like to find a way of doing this where we don't need this sort of thing!

I fear that might not be possible and you can't have one without the
other. You need the tables to setup the dimms and you need the smbios
handle to map the error to the dimm.

-Robert

>
>
> Thanks,
>
> James

Next message: Evan Green: "Re: [PATCH] platform/chrome: Expose resume result via sysfs"
Previous message: Nayna Jain: "[PATCH 1/2] powerpc/powernv: add OPAL APIs for secure variables"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]