Re: [PATCH] PCI: Warn about host bridge device when its numa node is NO_NODE

From: Yunsheng Lin
Date: Fri Oct 25 2019 - 04:51:44 EST

On 2019/10/25 16:16, Michal Hocko wrote:
> On Thu 24-10-19 12:40:13, Bjorn Helgaas wrote:
>> On Thu, Oct 24, 2019 at 11:20:13AM +0200, Michal Hocko wrote:
>>> On Wed 23-10-19 12:10:39, Bjorn Helgaas wrote:
>>>> On Wed, Oct 23, 2019 at 04:22:43PM +0800, Yunsheng Lin wrote:
>>>>> On 2019/10/23 5:04, Bjorn Helgaas wrote:
>>>>>> On Sat, Oct 19, 2019 at 02:45:43PM +0800, Yunsheng Lin wrote:
>>>>>> I think the underlying problem you're addressing is that:
>>>>>> - NUMA_NO_NODE == -1,
>>>>>> - dev_to_node(dev) may return NUMA_NO_NODE,
>>>>>> - kmalloc(dev) relies on cpumask_of_node(dev_to_node(dev)), and
>>>>>> - cpumask_of_node(NUMA_NO_NODE) makes an invalid array reference
>>>>>> For example, on arm64, mips loongson, s390, and x86,
>>>>>> cpumask_of_node(node) returns "node_to_cpumask_map[node]", and -1 is
>>>>>> an invalid array index.
>>>>> The invalid array index of -1 is the underlying problem here when
>>>>> cpumask_of_node(dev_to_node(dev)) is called and cpumask_of_node()
>>>>> is not NUMA_NO_NODE aware yet.
>>>>> In the "numa: make node_to_cpumask_map() NUMA_NO_NODE aware" thread
>>>>> disscusion, it is requested that it is better to warn about the pcie
>>>>> device without a node assigned by the firmware before making the
>>>>> cpumask_of_node() NUMA_NO_NODE aware, so that the system with pci
>>>>> devices of "NUMA_NO_NODE" node can be fixed by their vendor.
>>>>> See:
>>>> Right. We should warn if the NUMA node number would help us but DT or
>>>> the firmware didn't give us one.
>>>> But we can do that independently of any cpumask_of_node() changes.
>>>> There's no need to do one patch before the other. Even if you make
>>>> cpumask_of_node() tolerate NUMA_NO_NODE, we'll still get the warning
>>>> because we're not actually changing any node assignments.
>>> Agreed. And this has been proposed previously I believe but my
>>> understanding was that Petr was against that.
>> I don't think Peter was opposed to a warning.
> Now, he was opposed to cpumask_of_node handling IIRC.

That was my understanding too.

But I am not sure if Peter is still opposed to cpumask_of_node() after
the pcie device without node affinity is warned in this patch.

>From previous discussion [1]:

>Yunsheng> But I failed to see why the above is related to making node_to_cpumask_map()
>Yunsheng> NUMA_NO_NODE aware?

Peter> Your initial bug is for hns3, which is a PCI device, which really _MUST_
Peter> have a node assigned.

Peter> It not having one, is a straight up bug. We must not silently accept
Peter> NO_NODE there, ever."

I think Peter is not opposed to cpumask_of_node() after this patch if I
understand it correctly.


>> I assume you're
>> referring to [1], which is about how cpumask_of_node() should work.
>> That's not my area, so I don't have a strong opinion. From that
>> discussion:
>> Yunsheng> From what I can see, the problem can be fixed in three
>> Yunsheng> place:
>> Yunsheng> 1. Make user dev_to_node return a valid node id
>> Yunsheng> even when proximity domain is not set by bios(or node id
>> Yunsheng> set by buggy bios is not valid), which may need info from
>> Yunsheng> the numa system to make sure it will return a valid node.
>> Yunsheng>
>> Yunsheng> 2. User that call cpumask_of_node should ensure the node
>> Yunsheng> id is valid before calling cpumask_of_node, and user also
>> Yunsheng> need some info to make ensure node id is valid.
>> Yunsheng>
>> Yunsheng> 3. Make sure cpumask_of_node deal with invalid node id as
>> Yunsheng> this patchset.
>> Peter> 1) because even it is not set, the device really does belong
>> Peter> to a node. It is impossible a device will have magic
>> Peter> uniform access to memory when CPUs cannot.
>> Peter>
>> Peter> 2) is already true today, cpumask_of_node() requires a valid
>> Peter> node_id.
>> Peter>
>> Peter> 3) is just wrong and increases overhead for everyone.
>> I think Peter is advocating (1), i.e., if firmware doesn't tell us a
>> node ID, we just pick one. We can certainly do that in *addition* to
>> adding a warning. I'd like it to be a separate patch because I think
>> we want the warning no matter what so users have some clue that
>> performance could be better.
> Yes, those should be two different patches.
>> If we pick one, I guess we either assign some default, like node 0, to
>> everything; or we somehow pick a random node to assign.
> I have tried to explain that picking a default node is tricky because
> node 0 is not generally available and you never know when a node might
> just disappear if the device is not bound to it.

Maybe the example you already mentioned in previous discussion may help

3e8589963773 ("memcg: make it work on sparse non-0-node systems")

> I really do not see why the proposed online_cpu_mask for that case is
> such a big deal. It will likely lead to suboptimal performance but what
> do you expect from a suboptimal HW description. There is no question
> that the proper node affinity should be set and the warning might really
> help here but trying to be clever and find a replacement sounds like
> potential for more subtly broken results than doing a straightforward
> thing.
> But I will just go silent now because I am just repeating myself.