RE: [PATCH v2] device-dax: use fallback nid when numa node is invalid

From: Justin He
Date: Wed Sep 15 2021 - 02:51:15 EST




> -----Original Message-----
> From: Dan Williams <dan.j.williams@xxxxxxxxx>
> Sent: Wednesday, September 15, 2021 1:16 PM
> To: Justin He <Justin.He@xxxxxxx>
> Cc: Vishal Verma <vishal.l.verma@xxxxxxxxx>; Dave Jiang
> <dave.jiang@xxxxxxxxx>; David Hildenbrand <david@xxxxxxxxxx>; Linux NVDIMM
> <nvdimm@xxxxxxxxxxxxxxx>; Linux Kernel Mailing List <linux-
> kernel@xxxxxxxxxxxxxxx>; nd <nd@xxxxxxx>
> Subject: Re: [PATCH v2] device-dax: use fallback nid when numa node is
> invalid
>
> On Mon, Sep 13, 2021 at 7:06 PM Justin He <Justin.He@xxxxxxx> wrote:
> >
> > Hi Dan,
> >
> > > -----Original Message-----
> > > From: Dan Williams <dan.j.williams@xxxxxxxxx>
> > > Sent: Friday, September 10, 2021 11:42 PM
> > > To: Justin He <Justin.He@xxxxxxx>
> > > Cc: Vishal Verma <vishal.l.verma@xxxxxxxxx>; Dave Jiang
> > > <dave.jiang@xxxxxxxxx>; David Hildenbrand <david@xxxxxxxxxx>; Linux
> NVDIMM
> > > <nvdimm@xxxxxxxxxxxxxxx>; Linux Kernel Mailing List <linux-
> > > kernel@xxxxxxxxxxxxxxx>
> > > Subject: Re: [PATCH v2] device-dax: use fallback nid when numa node is
> > > invalid
> > >
> > > On Fri, Sep 10, 2021 at 5:46 AM Jia He <justin.he@xxxxxxx> wrote:
> > > >
> > > > Previously, numa_off was set unconditionally in dummy_numa_init()
> > > > even with a fake numa node. Then ACPI sets node id as NUMA_NO_NODE(-1)
> > > > after acpi_map_pxm_to_node() because it regards numa_off as turning
> > > > off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
> > > > arm64 with fake numa case.
> > > >
> > > > Without this patch, pmem can't be probed as RAM devices on arm64 if
> > > > SRAT table isn't present:
> > > > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s
> 1g
> > > -a 64K
> > > > kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff]
> with
> > > invalid node: -1
> > > > kmem: probe of dax0.0 failed with error -22
> > > >
> > > > This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> > > >
> > > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
> > > > Signed-off-by: Jia He <justin.he@xxxxxxx>
> > > > ---
> > > > v2: - rebase it based on David's "memory group" patch.
> > > > - drop the changes in dev_dax_kmem_remove() since nid had been
> > > > removed in remove_memory().
> > > > drivers/dax/kmem.c | 31 +++++++++++++++++--------------
> > > > 1 file changed, 17 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > > > index a37622060fff..e4836eb7539e 100644
> > > > --- a/drivers/dax/kmem.c
> > > > +++ b/drivers/dax/kmem.c
> > > > @@ -47,20 +47,7 @@ static int dev_dax_kmem_probe(struct dev_dax
> *dev_dax)
> > > > unsigned long total_len = 0;
> > > > struct dax_kmem_data *data;
> > > > int i, rc, mapped = 0;
> > > > - int numa_node;
> > > > -
> > > > - /*
> > > > - * Ensure good NUMA information for the persistent memory.
> > > > - * Without this check, there is a risk that slow memory
> > > > - * could be mixed in a node with faster memory, causing
> > > > - * unavoidable performance issues.
> > > > - */
> > > > - numa_node = dev_dax->target_node;
> > > > - if (numa_node < 0) {
> > > > - dev_warn(dev, "rejecting DAX region with invalid
> > > node: %d\n",
> > > > - numa_node);
> > > > - return -EINVAL;
> > > > - }
> > > > + int numa_node = dev_dax->target_node;
> > > >
> > > > for (i = 0; i < dev_dax->nr_range; i++) {
> > > > struct range range;
> > > > @@ -71,6 +58,22 @@ static int dev_dax_kmem_probe(struct dev_dax
> *dev_dax)
> > > > i, range.start, range.end);
> > > > continue;
> > > > }
> > > > +
> > > > + /*
> > > > + * Ensure good NUMA information for the persistent
> > > memory.
> > > > + * Without this check, there is a risk but not fatal
> > > that slow
> > > > + * memory could be mixed in a node with faster memory,
> > > causing
> > > > + * unavoidable performance issues. Warn this and use
> > > fallback
> > > > + * node id.
> > > > + */
> > > > + if (numa_node < 0) {
> > > > + int new_node =
> > > memory_add_physaddr_to_nid(range.start);
> > > > +
> > > > + dev_info(dev, "changing nid from %d to %d for
> > > DAX region [%#llx-%#llx]\n",
> > > > + numa_node, new_node, range.start,
> > > range.end);
> > > > + numa_node = new_node;
> > > > + }
> > > > +
> > > > total_len += range_len(&range);
> > >
> > > This fallback change belongs where the parent region for the namespace
> > > adopts its target_node, because it's not clear
> > > memory_add_physaddr_to_nid() is the right fallback in all situations.
> > > Here is where this setting is happening currently:
> > >
> > > drivers/acpi/nfit/core.c:3004: ndr_desc->target_node =
> > > pxm_to_node(spa->proximity_domain);
> > On my local arm64 guest('virt' machine type), the target_node is
> > set to -1 at this line.
> > That is:
> > The condition "spa->flags & ACPI_NFIT_PROXIMITY_VALID" is hit.
> >
> > > drivers/acpi/nfit/core.c:3007: ndr_desc->target_node =
> > > NUMA_NO_NODE;
> > > drivers/nvdimm/e820.c:29: ndr_desc.target_node = nid;
> > > drivers/nvdimm/of_pmem.c:58: ndr_desc.target_node =
> > > ndr_desc.numa_node;
> > > drivers/nvdimm/region_devs.c:1127: nd_region->target_node =
> > > ndr_desc->target_node;
> >
> >
> > Sorry,Dan. I thought I missed your previous mail:
> >
> > =========================================
> > Looks like it is the NFIT driver, thanks.
> >
> > If you're getting NUMA_NO_NODE in dax_kmem from the NFIT driver in
> > means your ACPI NFIT table is failing to populate correct numa
> > information. You could try the following to fix it up, but I think the
> > real problem is that your platform BIOS needs to add the proper numa
> > data.
> >
> > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> > index fb775b967c52..d3a0cec635b1 100644
> > --- a/drivers/acpi/nfit/core.c
> > +++ b/drivers/acpi/nfit/core.c
> > @@ -3005,15 +3005,8 @@ static int acpi_nfit_register_region(struct
> > acpi_nfit_desc *acpi_desc,
> > ndr_desc->res = &res;
> > ndr_desc->provider_data = nfit_spa;
> > ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
> > - if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) {
> > - ndr_desc->numa_node = acpi_map_pxm_to_online_node(
> > - spa->proximity_domain);
> > - ndr_desc->target_node = acpi_map_pxm_to_node(
> > - spa->proximity_domain);
> > - } else {
> > - ndr_desc->numa_node = NUMA_NO_NODE;
> > - ndr_desc->target_node = NUMA_NO_NODE;
> > - }
> > + ndr_desc->numa_node = memory_add_physaddr_to_nid(spa->address);
> > + ndr_desc->target_node = phys_to_target_node(spa->address);
> >
> > /*
> > * Persistence domain bits are hierarchical, if
> > ===================================================
> >
> > Do you still suggest fixing like this?
>
> Are you saying that ACPI_NFIT_PROXIMITY_VALID is not set on your
> platform, or that pxm_to_node() returns NUMA_NO_NODE?
>
Latter, ACPI_NFIT_PROXIMITY_VALID is *set* in my case.

> I would expect something like this:
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index a3ef6cce644c..95de7dc18ed8 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -3007,6 +3007,15 @@ static int acpi_nfit_register_region(struct
> acpi_nfit_desc *acpi_desc,
> ndr_desc->target_node = NUMA_NO_NODE;
> }
>
> + /*
> + * Fallback to address based numa information if node lookup
> + * failed
> + */
> + if (ndr_desc->numa_node == NUMA_NO_NODE)
> + ndr_desc->numa_node = memory_add_physaddr_to_nid(spa-
> >address);
> + if (ndr_desc->target_node == NUMA_NO_NODE)
> + phys_to_target_node(spa->address);
> +

Would it better to add a dev_info() here to report this node id changing?

--
Cheers,
Justin (Jia He)