RE: [PATCH] PCI: hv: Fix NUMA node assignment when kernel boots with parameters affecting NUMA topology

From: Michael Kelley (LINUX)
Date: Mon Jan 10 2022 - 11:12:40 EST


From: Long Li <longli@xxxxxxxxxxxxx> Sent: Friday, January 7, 2022 12:32 PM
> >
> > From: longli@xxxxxxxxxxxxxxxxx <longli@xxxxxxxxxxxxxxxxx> Sent:
> > Thursday, January 6, 2022 3:20 PM
> > >
> > > When the kernel boots with parameters restricting the number of cpus
> > > or NUMA nodes, e.g. maxcpus=X or numa=off, the vPCI driver should only
> > > set to the NUMA node to a value that is valid in the current running kernel.
> > >
> > > Signed-off-by: Long Li <longli@xxxxxxxxxxxxx>
> > > ---
> > > drivers/pci/controller/pci-hyperv.c | 17 +++++++++++++++--
> > > 1 file changed, 15 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > b/drivers/pci/controller/pci- hyperv.c index
> > > fc1a29acadbb..8686343eff4c 100644
> > > --- a/drivers/pci/controller/pci-hyperv.c
> > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > @@ -1835,8 +1835,21 @@ static void hv_pci_assign_numa_node(struct
> > > hv_pcibus_device *hbus)
> > > if (!hv_dev)
> > > continue;
> > >
> > > - if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY)
> > > - set_dev_node(&dev->dev, hv_dev->desc.virtual_numa_node);
> > > + if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY) {
> > > + int cpu;
> > > + bool found_node = false;
> > > +
> > > + for_each_possible_cpu(cpu)
> > > + if (cpu_to_node(cpu) ==
> > > + hv_dev->desc.virtual_numa_node) {
> > > + found_node = true;
> > > + break;
> > > + }
> > > +
> > > + if (found_node)
> > > + set_dev_node(&dev->dev,
> > > + hv_dev->desc.virtual_numa_node);
> > > + }
> >
> > I'm wondering about this approach vs. just comparing against nr_node_ids.
>
> I was trying to fix this by comparing with nr_node_ids. This worked for
> numa=off, but it didn't work with maxcpus=X.
>
> maxcpus=X is commonly used in kdump kernels. In this config, the memory
> system is initialized in a way that only the NUMA nodes within maxcpus are
> setup and can be used by the drivers.

In looking at a 5.16 kernel running in a Hyper-V VM on two NUMA
nodes, the number of NUMA nodes configured in the kernel is not
affected by maxcpus= on the kernel boot line. This VM has 48 vCPUs
and 2 NUMA nodes, and is Generation 2. Even with maxcpus=4 or
maxcpus=1, these lines are output during boot:

[ 0.238953] NODE_DATA(0) allocated [mem 0x7edffd5000-0x7edfffffff]
[ 0.241397] NODE_DATA(1) allocated [mem 0xfcdffd4000-0xfcdfffefff]

and

[ 0.280039] Initmem setup node 0 [mem 0x0000000000001000-0x0000007edfffffff]
[ 0.282869] Initmem setup node 1 [mem 0x0000007ee0000000-0x000000fcdfffffff]

It's perfectly legit to have a NUMA node with memory but no CPUs. The
memory assigned to the NUMA node is determined by the ACPI SRAT. So
I'm wondering what is causing the kdump issue you see. Or maybe the
behavior of older kernels is different.

>
> > Comparing against nr_node_ids would handle the case of numa=off on the
> > kernel boot line, or a kernel built with CONFIG_NUMA=n, or the use of
> > numa=fake. Your approach is also affected by which CPUs are online, since
> > cpu_to_node() references percpu data. It would seem to produce more
> > variable results since CPUs can go online and offline while the VM is running.
> > If a network VF device was removed and re-added, the results of your
> > algorithm could be different for the re-add, depending on which CPUs were
> > online at the time.
> >
> > My impression (which may be incorrect) is that the device numa_node is
> > primarily to allow the driver to allocate memory from the closest NUMA node,
> > and such memory allocations don't really need to be affected by which CPUs
> > are online.
>
> Yes, this is the reason I'm using for_each_possible_cpu(). Even if some CPUs
> are not online, the memory system is setup in a way that allow driver to
> allocate memory on that NUMA node. The algorithm guarantees the value of
> NUMA node is valid when calling set_dev_node().
>

I'm thinking the code here should check against nr_node_ids, to catch the
numa=off or CONFIG_NUMA=n cases. Then could use either node_online()
or numa_map_to_online_node(), but I'm still curious as to how we would
get an offline NUMA node given how Hyper-V normally sets up a VM.

NUMA nodes only transition from online to offline if there are no CPUs
or memory assigned. That can happen if the CPUs are taken offline (or
never came online) and if the memory is hot-removed. We don't currently
support hot-remove memory in Hyper-V VMs, though there has been
some discussion about adding it. I'm not sure how that case is supposed
to be handled if the NUMA node is stashed in some device and get used
during dma_alloc_coherent(), for example. That seems to be a general
Linux problem unless there's a mechanism for handling it that I haven't
noticed.

Michael