Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

From: Yunsheng Lin
Date: Mon Oct 14 2019 - 04:00:55 EST

On 2019/10/12 18:47, Greg KH wrote:
> On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
>> On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
>>> On 2019/10/12 15:40, Greg KH wrote:
>>>> On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
>>>>> add pci and acpi maintainer
>>>>> cc linux-pci@xxxxxxxxxxxxxxx and linux-acpi@xxxxxxxxxxxxxxx
>>>>> On 2019/10/11 19:15, Peter Zijlstra wrote:
>>>>>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
>>>>>>> But I failed to see why the above is related to making node_to_cpumask_map()
>>>>>>> NUMA_NO_NODE aware?
>>>>>> Your initial bug is for hns3, which is a PCI device, which really _MUST_
>>>>>> have a node assigned.
>>>>>> It not having one, is a straight up bug. We must not silently accept
>>>>>> NO_NODE there, ever.
>>>>> I suppose you mean reporting a lack of affinity when the node of a pcie
>>>>> device is not set by "not silently accept NO_NODE".
>>>> If the firmware of a pci device does not provide the node information,
>>>> then yes, warn about that.
>>>>> As Greg has asked about in [1]:
>>>>> what is a user to do when the user sees the kernel reporting that?
>>>>> We may tell user to contact their vendor for info or updates about
>>>>> that when they do not know about their system well enough, but their
>>>>> vendor may get away with this by quoting ACPI spec as the spec
>>>>> considering this optional. Should the user believe this is indeed a
>>>>> fw bug or a misreport from the kernel?
>>>> Say it is a firmware bug, if it is a firmware bug, that's simple.
>>>>> If this kind of reporting is common pratice and will not cause any
>>>>> misunderstanding, then maybe we can report that.
>>>> Yes, please do so, that's the only way those boxes are ever going to get
>>>> fixed. And go add the test to the "firmware testing" tool that is based
>>>> on Linux that Intel has somewhere, to give vendors a chance to fix this
>>>> before they ship hardware.
>>>> This shouldn't be a big deal, we warn of other hardware bugs all the
>>>> time.
>>> Ok, thanks for clarifying.
>>> Will send a patch to catch the case when a pcie device without numa node
>>> being set and warn about it.
>>> Maybe use dev->bus to verify if it is a pci device?
>> No, do that in the pci bus core code itself, when creating the devices
>> as that is when you know, or do not know, the numa node, right?
>> This can't be in the driver core only, as each bus type will have a
>> different way of determining what the node the device is on. For some
>> reason, I thought the PCI core code already does this, right?
> Yes, pci_irq_get_node(), which NO ONE CALLS! I should go delete that
> thing...
> Anyway, it looks like the pci core code does call set_dev_node() based
> on the PCI bridge, so if that is set up properly, all should be fine.
> If not, well, you have buggy firmware and you need to warn about that at
> the time you are creating the bridge. Look at the call to
> pcibus_to_node() in pci_register_host_bridge().

Thanks for pointing out the specific function.
Maybe we do not need to warn about the case when the device has a parent,
because we must have warned about the parent if the device has a parent
and the parent also has a node of NO_NODE, so do not need to warn the child
device anymore? like blew:

@@ -932,6 +932,10 @@ static int pci_register_host_bridge(struct pci_host_bridge *bridge)
list_add_tail(&bus->node, &pci_root_buses);

+ if (nr_node_ids > 1 && !parent &&
+ dev_to_node(bus->bridge) == NUMA_NO_NODE)
+ dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable HW. Please contact your vendor for updates.\n");
return 0;

Also, we do not need to warn about that in pci_device_add(), Right?
Because we must have warned about the pci host bridge of the pci device.

I may be wrong about above because I am not so familiar with the pci.

> And yes, you need to do this all on a per-bus-type basis, as has been
> pointed out. It's up to the bus to create the device and set this up
> properly.

Will do that on per-bus-type basis.

> thanks,
> greg k-h
> .