Re: [PATCH] arm64: numa: check the node id before accessing node_to_cpumask_map

From: Michal Hocko
Date: Fri Aug 30 2019 - 04:39:31 EST


On Fri 30-08-19 16:08:14, Yunsheng Lin wrote:
> On 2019/8/30 14:44, Michal Hocko wrote:
> > On Fri 30-08-19 14:35:26, Yunsheng Lin wrote:
> >> On 2019/8/30 13:55, Michal Hocko wrote:
> >>> On Fri 30-08-19 10:26:31, Yunsheng Lin wrote:
> >>>> Some buggy bios may not set the device' numa id, and dev_to_node
> >>>> will return -1, which may cause global-out-of-bounds error
> >>>> detected by KASAN.
> >>>
> >>> Why should we workaround a buggy bios like that? Is it so widespread and
> >>> no BIOS update available? Also, why is this arm64 specific?
> >>
> >> For our case, there is BIOS update available. I just thought it might
> >> be better to protect from this case when BIOS has not implemented the
> >> device' numa id setting feature or the feature from BIOS has some bug.
> >>
> >> It is not arm64 specific, right now I only have arm64 board. If it is
> >> ok to protect this from the buggy BIOS, maybe all other arch can be
> >> changed too.
> >
> > If we are to really care then this should be consistent among
> > architectures IMHO. But I am not really sure this is really worth it.
> > The code is quite old and I do not really remember any reports.
>
> It is only detected by enabling KASAN, the system seems to run fine without
> any visible error if KASAN is disabled. Maybe there is why no report has
> been seen?
>
> Also according to Section 6.2.14 from ACPI spec 6.3 [1], the setting of proximity
> domain is optional, as below:
>
> This optional object is used to describe proximity domain
> associations within a machine. _PXM evaluates to an integer
> that identifies a device as belonging to a Proximity Domain
> defined in the System Resource Affinity Table (SRAT).
>
>
> Do you think it is ok to resend the fix with above clarification and below
> log:

OK, if the specification really allows to provide NUMA_NO_NODE (-1) then
the code really has to be prepared for that. And ideally all arches
should deal with that.

> [ 42.970381] ==================================================================
> [ 42.977595] BUG: KASAN: global-out-of-bounds in __bitmap_weight+0x48/0xb0
> [ 42.984370] Read of size 8 at addr ffff20008cdf8790 by task kworker/0:1/13
> [ 42.991230]
> [ 42.992712] CPU: 0 PID: 13 Comm: kworker/0:1 Tainted: G O 5.2.0-rc4-g8bde06a-dirty #3
> [ 43.001830] Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS 2280-A CS V2.B050.01 08/08/2019
> [ 43.011298] Workqueue: events work_for_cpu_fn
> [ 43.015643] Call trace:
> [ 43.018078] dump_backtrace+0x0/0x1e8
> [ 43.021727] show_stack+0x14/0x20
> [ 43.025031] dump_stack+0xc4/0xfc
> [ 43.028335] print_address_description+0x178/0x270
> [ 43.033113] __kasan_report+0x164/0x1b8
> [ 43.036936] kasan_report+0xc/0x18
> [ 43.040325] __asan_load8+0x84/0xa8
> [ 43.043801] __bitmap_weight+0x48/0xb0
> [ 43.047552] hclge_init_ae_dev+0x988/0x1e78 [hclge]
> [ 43.052418] hnae3_register_ae_dev+0xcc/0x278 [hnae3]
> [ 43.057467] hns3_probe+0xe0/0x120 [hns3]
> [ 43.061464] local_pci_probe+0x74/0xf0
> [ 43.065200] work_for_cpu_fn+0x2c/0x48
> [ 43.068937] process_one_work+0x3c0/0x878
> [ 43.072934] worker_thread+0x400/0x670
> [ 43.076670] kthread+0x1b0/0x1b8
> [ 43.079885] ret_from_fork+0x10/0x18
> [ 43.083446]
> [ 43.084925] The buggy address belongs to the variable:
> [ 43.090052] numa_distance+0x30/0x40
> [ 43.093613]
> [ 43.095091] Memory state around the buggy address:
> [ 43.099870] ffff20008cdf8680: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 fa fa
> [ 43.107078] ffff20008cdf8700: fa fa fa fa 04 fa fa fa fa fa fa fa 00 fa fa fa
> [ 43.114286] >ffff20008cdf8780: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
> [ 43.121494] ^
> [ 43.125230] ffff20008cdf8800: 01 fa fa fa fa fa fa fa 04 fa fa fa fa fa fa fa
> [ 43.132439] ffff20008cdf8880: fa fa fa fa fa fa fa fa 00 00 fa fa fa fa fa fa
> [ 43.139646] ==================================================================
>
>
> >
> >>>> This patch changes cpumask_of_node to return cpu_none_mask if the
> >>>> node is not valid, and sync the cpumask_of_node between the
> >>>> cpumask_of_node function in numa.h and numa.c.
> >>>
> >>> Why?
> >>
> >> When CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node() in
> >> numa.c is used, if not, the cpumask_of_node() in numa.h is used.
> >>
> >> I am not sure why there is difference between them, and it is there
> >> when since the below commit:
> >> 1a2db300348b ("arm64, numa: Add NUMA support for arm64 platforms.")
> >>
> >> I synced them to keep them consistent whether CONFIG_DEBUG_PER_CPU_MAPS
> >> is defined.
> >
> > Such a change should be made in a separate patch with a full
> > clarification/justification. From the above it is still not clear why
> > this is needed though.
>
> Ok.
>
> How about:
>
> Currently there are different implementations of cpumask_of_node() depend
> on the arch, for example:
>
> ia64:
> #define cpumask_of_node(node) ((node) == -1 ? \
> cpu_all_mask : \
> &node_to_cpu_mask[node])
>
>
> alpha:
> static const struct cpumask *cpumask_of_node(int node)
> {
> int cpu;
>
> if (node == NUMA_NO_NODE)
> return cpu_all_mask;
>
> cpumask_clear(&node_to_cpumask_map[node]);
>
> for_each_online_cpu(cpu) {
> if (cpu_to_node(cpu) == node)
> cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
> }
>
> return &node_to_cpumask_map[node];
> }
>
> Even for the same arch, there are two implementations depend on the
> CONFIG_DEBUG_PER_CPU_MAPS configuration.
>
> arm64/x86 without CONFIG_DEBUG_PER_CPU_MAPS:
> static inline const struct cpumask *cpumask_of_node(int node)
> {
> return node_to_cpumask_map[node];
> }
>
> arm64/x86 with CONFIG_DEBUG_PER_CPU_MAPS:
> const struct cpumask *cpumask_of_node(int node)
> {
> if (WARN_ON(node >= nr_node_ids))
> return cpu_none_mask;
>
> if (WARN_ON(node_to_cpumask_map[node] == NULL))
> return cpu_online_mask;
>
> return node_to_cpumask_map[node];
> }
>
> It seems the cpumask_of_node with CONFIG_DEBUG_PER_CPU_MAPS is used
> to catch the erorr case and give a warning to user when node id is not
> valid.

Yeah the config help text
config DEBUG_PER_CPU_MAPS
bool "Debug access to per_cpu maps"
depends on DEBUG_KERNEL
depends on SMP
help
Say Y to verify that the per_cpu map being accessed has
been set up. This adds a fair amount of code to kernel memory
and decreases performance.

Say N if unsure.

suggests that this is intentionally hidden behind a config so a normal
path shouldn't really duplicate it. If those checks make sense in
general then the config option should be dropped I think.
--
Michal Hocko
SUSE Labs