Re: [PATCH] arm64: numa: check the node id before accessing node_to_cpumask_map

From: Yunsheng Lin
Date: Fri Aug 30 2019 - 05:50:18 EST


On 2019/8/30 16:39, Michal Hocko wrote:
> On Fri 30-08-19 16:08:14, Yunsheng Lin wrote:
>> On 2019/8/30 14:44, Michal Hocko wrote:
>>> On Fri 30-08-19 14:35:26, Yunsheng Lin wrote:
>>>> On 2019/8/30 13:55, Michal Hocko wrote:
>>>>> On Fri 30-08-19 10:26:31, Yunsheng Lin wrote:
>>>>>> Some buggy bios may not set the device' numa id, and dev_to_node
>>>>>> will return -1, which may cause global-out-of-bounds error
>>>>>> detected by KASAN.
>>>>>
>>>>> Why should we workaround a buggy bios like that? Is it so widespread and
>>>>> no BIOS update available? Also, why is this arm64 specific?
>>>>
>>>> For our case, there is BIOS update available. I just thought it might
>>>> be better to protect from this case when BIOS has not implemented the
>>>> device' numa id setting feature or the feature from BIOS has some bug.
>>>>
>>>> It is not arm64 specific, right now I only have arm64 board. If it is
>>>> ok to protect this from the buggy BIOS, maybe all other arch can be
>>>> changed too.
>>>
>>> If we are to really care then this should be consistent among
>>> architectures IMHO. But I am not really sure this is really worth it.
>>> The code is quite old and I do not really remember any reports.
>>
>> It is only detected by enabling KASAN, the system seems to run fine without
>> any visible error if KASAN is disabled. Maybe there is why no report has
>> been seen?
>>
>> Also according to Section 6.2.14 from ACPI spec 6.3 [1], the setting of proximity
>> domain is optional, as below:
>>
>> This optional object is used to describe proximity domain
>> associations within a machine. _PXM evaluates to an integer
>> that identifies a device as belonging to a Proximity Domain
>> defined in the System Resource Affinity Table (SRAT).
>>
>>
>> Do you think it is ok to resend the fix with above clarification and below
>> log:
>
> OK, if the specification really allows to provide NUMA_NO_NODE (-1) then
> the code really has to be prepared for that. And ideally all arches
> should deal with that.
>
>> [ 42.970381] ==================================================================
>> [ 42.977595] BUG: KASAN: global-out-of-bounds in __bitmap_weight+0x48/0xb0
>> [ 42.984370] Read of size 8 at addr ffff20008cdf8790 by task kworker/0:1/13
>> [ 42.991230]
>> [ 42.992712] CPU: 0 PID: 13 Comm: kworker/0:1 Tainted: G O 5.2.0-rc4-g8bde06a-dirty #3
>> [ 43.001830] Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS 2280-A CS V2.B050.01 08/08/2019
>> [ 43.011298] Workqueue: events work_for_cpu_fn
>> [ 43.015643] Call trace:
>> [ 43.018078] dump_backtrace+0x0/0x1e8
>> [ 43.021727] show_stack+0x14/0x20
>> [ 43.025031] dump_stack+0xc4/0xfc
>> [ 43.028335] print_address_description+0x178/0x270
>> [ 43.033113] __kasan_report+0x164/0x1b8
>> [ 43.036936] kasan_report+0xc/0x18
>> [ 43.040325] __asan_load8+0x84/0xa8
>> [ 43.043801] __bitmap_weight+0x48/0xb0
>> [ 43.047552] hclge_init_ae_dev+0x988/0x1e78 [hclge]
>> [ 43.052418] hnae3_register_ae_dev+0xcc/0x278 [hnae3]
>> [ 43.057467] hns3_probe+0xe0/0x120 [hns3]
>> [ 43.061464] local_pci_probe+0x74/0xf0
>> [ 43.065200] work_for_cpu_fn+0x2c/0x48
>> [ 43.068937] process_one_work+0x3c0/0x878
>> [ 43.072934] worker_thread+0x400/0x670
>> [ 43.076670] kthread+0x1b0/0x1b8
>> [ 43.079885] ret_from_fork+0x10/0x18
>> [ 43.083446]
>> [ 43.084925] The buggy address belongs to the variable:
>> [ 43.090052] numa_distance+0x30/0x40
>> [ 43.093613]
>> [ 43.095091] Memory state around the buggy address:
>> [ 43.099870] ffff20008cdf8680: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 fa fa
>> [ 43.107078] ffff20008cdf8700: fa fa fa fa 04 fa fa fa fa fa fa fa 00 fa fa fa
>> [ 43.114286] >ffff20008cdf8780: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
>> [ 43.121494] ^
>> [ 43.125230] ffff20008cdf8800: 01 fa fa fa fa fa fa fa 04 fa fa fa fa fa fa fa
>> [ 43.132439] ffff20008cdf8880: fa fa fa fa fa fa fa fa 00 00 fa fa fa fa fa fa
>> [ 43.139646] ==================================================================
>>
>>
>>>
>>>>>> This patch changes cpumask_of_node to return cpu_none_mask if the
>>>>>> node is not valid, and sync the cpumask_of_node between the
>>>>>> cpumask_of_node function in numa.h and numa.c.
>>>>>
>>>>> Why?
>>>>
>>>> When CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node() in
>>>> numa.c is used, if not, the cpumask_of_node() in numa.h is used.
>>>>
>>>> I am not sure why there is difference between them, and it is there
>>>> when since the below commit:
>>>> 1a2db300348b ("arm64, numa: Add NUMA support for arm64 platforms.")
>>>>
>>>> I synced them to keep them consistent whether CONFIG_DEBUG_PER_CPU_MAPS
>>>> is defined.
>>>
>>> Such a change should be made in a separate patch with a full
>>> clarification/justification. From the above it is still not clear why
>>> this is needed though.
>>
>> Ok.
>>
>> How about:
>>
>> Currently there are different implementations of cpumask_of_node() depend
>> on the arch, for example:
>>
>> ia64:
>> #define cpumask_of_node(node) ((node) == -1 ? \
>> cpu_all_mask : \
>> &node_to_cpu_mask[node])
>>
>>
>> alpha:
>> static const struct cpumask *cpumask_of_node(int node)
>> {
>> int cpu;
>>
>> if (node == NUMA_NO_NODE)
>> return cpu_all_mask;
>>
>> cpumask_clear(&node_to_cpumask_map[node]);
>>
>> for_each_online_cpu(cpu) {
>> if (cpu_to_node(cpu) == node)
>> cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
>> }
>>
>> return &node_to_cpumask_map[node];
>> }
>>
>> Even for the same arch, there are two implementations depend on the
>> CONFIG_DEBUG_PER_CPU_MAPS configuration.
>>
>> arm64/x86 without CONFIG_DEBUG_PER_CPU_MAPS:
>> static inline const struct cpumask *cpumask_of_node(int node)
>> {
>> return node_to_cpumask_map[node];
>> }
>>
>> arm64/x86 with CONFIG_DEBUG_PER_CPU_MAPS:
>> const struct cpumask *cpumask_of_node(int node)
>> {
>> if (WARN_ON(node >= nr_node_ids))
>> return cpu_none_mask;
>>
>> if (WARN_ON(node_to_cpumask_map[node] == NULL))
>> return cpu_online_mask;
>>
>> return node_to_cpumask_map[node];
>> }
>>
>> It seems the cpumask_of_node with CONFIG_DEBUG_PER_CPU_MAPS is used
>> to catch the erorr case and give a warning to user when node id is not
>> valid.
>
> Yeah the config help text
> config DEBUG_PER_CPU_MAPS
> bool "Debug access to per_cpu maps"
> depends on DEBUG_KERNEL
> depends on SMP
> help
> Say Y to verify that the per_cpu map being accessed has
> been set up. This adds a fair amount of code to kernel memory
> and decreases performance.
>
> Say N if unsure.
>
> suggests that this is intentionally hidden behind a config so a normal
> path shouldn't really duplicate it. If those checks make sense in
> general then the config option should be dropped I think.

It seems cpumask_of_node with CONFIG_DEBUG_PER_CPU_MAPS on may be used to
debug some early use of cpumask_of_node problem, see below:

/*
* Allocate node_to_cpumask_map based on number of available nodes
* Requires node_possible_map to be valid.
*
* Note: cpumask_of_node() is not valid until after this is done.
* (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.)
*/
static void __init setup_node_to_cpumask_map(void)
{
int node;

/* setup nr_node_ids if not done yet */
if (nr_node_ids == MAX_NUMNODES)
setup_nr_node_ids();

/* allocate and clear the mapping */
for (node = 0; node < nr_node_ids; node++) {
alloc_bootmem_cpumask_var(&node_to_cpumask_map[node]);
cpumask_clear(node_to_cpumask_map[node]);
}

/* cpumask_of_node() will now work */
pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
}

So I prefer to keep it as two implementations for arm64 and x86, but
ensure the two implementations be consistent. It can be cleaned up later
when there is no use at all.

Is it ok with you?

>