Re: [PATCH] driver core: ensure a device has valid node id in device_add()

From: Michal Hocko
Date: Wed Sep 11 2019 - 02:49:30 EST


On Wed 11-09-19 14:15:51, Yunsheng Lin wrote:
> On 2019/9/11 13:33, Michal Hocko wrote:
> > On Tue 10-09-19 14:53:39, Michal Hocko wrote:
> >> On Tue 10-09-19 20:47:40, Yunsheng Lin wrote:
> >>> On 2019/9/10 19:12, Greg KH wrote:
> >>>> On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote:
> >>>>> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote:
> >>>>>> On 2019/9/10 17:31, Greg KH wrote:
> >>>>>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
> >>>>>>>> On 2019/9/9 17:53, Greg KH wrote:
> >>>>>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
> >>>>>>>>>> Currently a device does not belong to any of the numa nodes
> >>>>>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
> >>>>>>>>>> specified by fw nor by virtual device layer and the device has
> >>>>>>>>>> no parent device.
> >>>>>>>>>
> >>>>>>>>> Is this really a problem?
> >>>>>>>>
> >>>>>>>> Not really.
> >>>>>>>> Someone need to guess the node id when it is not specified, right?
> >>>>>>>
> >>>>>>> No, why? Guessing guarantees you will get it wrong on some systems.
> >>>>>>>
> >>>>>>> Are you seeing real problems because the id is not being set? What
> >>>>>>> problem is this fixing that you can actually observe?
> >>>>>>
> >>>>>> When passing the return value of dev_to_node() to cpumask_of_node()
> >>>>>> without checking the node id if the node id is not valid, there is
> >>>>>> global-out-of-bounds detected by KASAN as below:
> >>>>>
> >>>>> OK, I seem to remember this being brought up already. And now when I
> >>>>> think about it, we really want to make cpumask_of_node NUMA_NO_NODE
> >>>>> aware. That means using the same trick the allocator does for this
> >>>>> special case.
> >>>>
> >>>> That seems reasonable to me, and much more "obvious" as to what is going
> >>>> on.
> >>>>
> >>>
> >>> Ok, thanks for the suggestion.
> >>>
> >>> For arm64 and x86, there are two versions of cpumask_of_node().
> >>>
> >>> when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node()
> >>> in arch/x86/mm/numa.c is used, which does partial node id checking:
> >>>
> >>> const struct cpumask *cpumask_of_node(int node)
> >>> {
> >>> if (node >= nr_node_ids) {
> >>> printk(KERN_WARNING
> >>> "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
> >>> node, nr_node_ids);
> >>> dump_stack();
> >>> return cpu_none_mask;
> >>> }
> >>> if (node_to_cpumask_map[node] == NULL) {
> >>> printk(KERN_WARNING
> >>> "cpumask_of_node(%d): no node_to_cpumask_map!\n",
> >>> node);
> >>> dump_stack();
> >>> return cpu_online_mask;
> >>> }
> >>> return node_to_cpumask_map[node];
> >>> }
> >>>
> >>> when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node()
> >>> in arch/x86/include/asm/topology.h is used:
> >>>
> >>> static inline const struct cpumask *cpumask_of_node(int node)
> >>> {
> >>> return node_to_cpumask_map[node];
> >>> }
> >>
> >> I would simply go with. There shouldn't be any need for heavy weight
> >> checks that CONFIG_DEBUG_PER_CPU_MAPS has.
> >>
> >> static inline const struct cpumask *cpumask_of_node(int node)
> >> {
> >> /* A nice comment goes here */
> >> if (node == NUMA_NO_NODE)
>
> How about "(unsigned int)node >= nr_node_ids", this is suggested
> by Peter, it checks the case where the node id set by fw is bigger
> or equal than nr_node_ids, and still handle the < 0 case, which
> includes NUMA_NO_NODE.

Isn't that a plain bug? Is something like that really happening?

> Maybe define a macro like below to do that in order to do
> the node checking consistently through kernel:
>
> #define numa_node_valid(node) ((unsigned int)(node) < nr_node_ids)
>
>
> >> return node_to_cpumask_map[numa_mem_id()];
> >> return node_to_cpumask_map[node];
> >> }
> >
> > Sleeping over this and thinking more about the actual semantic the above
> > is wrong. We cannot really copy the page allocator logic. Why? Simply
> > because the page allocator doesn't enforce the near node affinity. It
> > just picks it up as a preferred node but then it is free to fallback to
> > any other numa node. This is not the case here and node_to_cpumask_map will
> > only restrict to the particular node's cpus which would have really non
> > deterministic behavior depending on where the code is executed. So in
> > fact we really want to return cpu_online_mask for NUMA_NO_NODE.
>
> From below, if the __GFP_THISNODE is set, the fallback is not performed.

__GFP_THISNODE is a very specific situation when the caller knows which
node should be used for the allocation. NUMA_NO_NODE && __GFP_THISNODE
is a bug and I would dare to call it an undefined behavior.

> For node_to_cpumask_map() case, maybe we can return the cpumask that is
> on the node of cpu_to_node(raw_smp_processor_id()) for NUMA_NO_NODE,
> because the current cpu does belong to a node, and the node does have at
> least one cpu, which is the cpu is calling the node_to_cpumask_map().
>
> Make any sense?

No. Please read the above paragraph again. NUMA_NO_NODE really means no
node affinity. So all cpus should be usable. Making any assumptions
about a local context is just wrong.
--
Michal Hocko
SUSE Labs