[PATCH 2/2] x86/numa: instance all parsed numa node

From: Pingfan Liu
Date: Fri Jul 05 2019 - 00:16:48 EST


I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
is used to speed up kdump process, so it is not a rare case.

It turns out that some pgdat is not instanced when specifying nr_cpus, e.g, on
x86, not initialized by init_cpu_to_node()->init_memory_less_node(). But
device->numa_node info is used as preferred_nid param for
__alloc_pages_nodemask(), which causes NULL reference ac->zonelist =
node_zonelist(preferred_nid, gfp_mask);

Although this bug is detected on x86, it should affect all archs, where a
machine with a numa-node having no memory, if nr_cpus prevents the instance of
the node, and the device on the node tries to allocate memory with
device->numa_node info.

The patch takes the way by instancing all parsed numa node on x86. (for more
detail, please refer to section I and II)

I. Notes about the crashing info:
-1 kexec -l with nr_cpus=4
-2 system info
NUMA node0 CPU(s): 0,8,16,24
NUMA node1 CPU(s): 2,10,18,26
NUMA node2 CPU(s): 4,12,20,28
NUMA node3 CPU(s): 6,14,22,30
NUMA node4 CPU(s): 1,9,17,25
NUMA node5 CPU(s): 3,11,19,27
NUMA node6 CPU(s): 5,13,21,29
NUMA node7 CPU(s): 7,15,23,31
-3 panic stack
[...]
[ 5.721547] atomic64_test: passed for x86-64 platform with CX8 and with SSE
[ 5.729187] pcieport 0000:00:01.1: Signaling PME with IRQ 34
[ 5.735187] pcieport 0000:00:01.2: Signaling PME with IRQ 35
[ 5.741168] pcieport 0000:00:01.3: Signaling PME with IRQ 36
[ 5.747189] pcieport 0000:00:07.1: Signaling PME with IRQ 37
[ 5.754061] pcieport 0000:00:08.1: Signaling PME with IRQ 39
[ 5.760727] pcieport 0000:20:07.1: Signaling PME with IRQ 40
[ 5.766955] pcieport 0000:20:08.1: Signaling PME with IRQ 42
[ 5.772742] BUG: unable to handle kernel paging request at 0000000000002088
[ 5.773618] PGD 0 P4D 0
[ 5.773618] Oops: 0000 [#1] SMP NOPTI
[ 5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[ 5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018
[ 5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[ 5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
e1 44 89 e6 89
[ 5.773618] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246
[ 5.773618] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000
[ 5.773618] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
[ 5.773618] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002
[ 5.773618] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000
[ 5.773618] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002
[ 5.773618] FS: 0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000
[ 5.773618] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.773618] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0
[ 5.773618] Call Trace:
[ 5.773618] new_slab+0xa9/0x570
[ 5.773618] ___slab_alloc+0x375/0x540
[ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0
[ 5.773618] __slab_alloc+0x1c/0x38
[ 5.773618] __kmalloc_node_track_caller+0xc8/0x270
[ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0
[ 5.773618] devm_kmalloc+0x28/0x60
[ 5.773618] pinctrl_bind_pins+0x2b/0x2a0
[ 5.773618] really_probe+0x73/0x420
[ 5.773618] driver_probe_device+0x115/0x130
[ 5.773618] __driver_attach+0x103/0x110
[ 5.773618] ? driver_probe_device+0x130/0x130
[ 5.773618] bus_for_each_dev+0x67/0xc0
[ 5.773618] ? klist_add_tail+0x3b/0x70
[ 5.773618] bus_add_driver+0x41/0x260
[ 5.773618] ? pcie_port_setup+0x4d/0x4d
[ 5.773618] driver_register+0x5b/0xe0
[ 5.773618] ? pcie_port_setup+0x4d/0x4d
[ 5.773618] do_one_initcall+0x4e/0x1d4
[ 5.773618] ? init_setup+0x25/0x28
[ 5.773618] kernel_init_freeable+0x1c1/0x26e
[ 5.773618] ? loglevel+0x5b/0x5b
[ 5.773618] ? rest_init+0xb0/0xb0
[ 5.773618] kernel_init+0xa/0x110
[ 5.773618] ret_from_fork+0x22/0x40
[ 5.773618] Modules linked in:
[ 5.773618] CR2: 0000000000002088
[ 5.773618] ---[ end trace 1030c9120a03d081 ]---
[...]

-4 other notes about the reproduction of this bug:
On my test machine, this bug is covered by 'commit 0d76bcc960e6 ("Revert
"ACPI/PCI: Pay attention to device-specific _PXM node values"")', but the
crack caused by dev->numa_node is still exposed from other path.

II. history

I had a original try on [1], which took the way by deferring the instance of
offline node.

Later Michal has suggested a fix [2], which only consider node with memory as
online. Beside fixing this bug, that patch also aimed at excluding memory-less
node as a candidate when iterating the zones. It is a pity that the method
conflicts with the scheduler code, which assumes node with cpu as online too.
You can find the broken by "git grep for_each_online_node | grep sched" or the
discussion in tail of [3].

Since Michal has no time to continue on this issue. I pick it up again. This
patch drops the change of "node online" definition in [2], i.e. still consider
node as online if it has either cpu or memory. And keeps the rest main idea in
[2] of initializing all parsed node on x86. For other archs, they need extra
dedicated effort.

[1]: https://patchwork.kernel.org/patch/10738733/
[2]: https://lkml.org/lkml/2019/2/13/253
[3]: https://lore.kernel.org/lkml/20190528182011.GG1658@xxxxxxxxxxxxxx/T/

Signed-off-by: Pingfan Liu <kernelfans@xxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Mike Rapoport <rppt@xxxxxxxxxxxxx>
Cc: Tony Luck <tony.luck@xxxxxxxxx>
Cc: Andy Lutomirski <luto@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>
Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Oscar Salvador <osalvador@xxxxxxx>
Cc: Pavel Tatashin <pavel.tatashin@xxxxxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Cc: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>
Cc: Michael Ellerman <mpe@xxxxxxxxxxxxxx>
Cc: Stephen Rothwell <sfr@xxxxxxxxxxxxxxxx>
Cc: Qian Cai <cai@xxxxxx>
Cc: Barret Rhoden <brho@xxxxxxxxxx>
Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Cc: David Rientjes <rientjes@xxxxxxxxxx>
Cc: linux-mm@xxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx
---
arch/x86/mm/numa.c | 17 ++++++++++++-----
mm/page_alloc.c | 11 ++++++++---
2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index b48d507..5f5b558 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -732,6 +732,15 @@ static void __init init_memory_less_node(int nid)
*/
}

+static void __init init_parsed_rest_node(void)
+{
+ int node;
+
+ for_each_node_mask(node, node_possible_map)
+ if (!node_online(node))
+ init_memory_less_node(node);
+}
+
/*
* Setup early cpu_to_node.
*
@@ -752,6 +761,7 @@ void __init init_cpu_to_node(void)
u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);

BUG_ON(cpu_to_apicid == NULL);
+ init_parsed_rest_node();

for_each_possible_cpu(cpu) {
int node = numa_cpu_node(cpu);
@@ -759,11 +769,8 @@ void __init init_cpu_to_node(void)
if (node == NUMA_NO_NODE)
continue;

- if (!node_online(node)) {
- init_memory_less_node(node);
- node_set_online(nid);
- }
-
+ if (!node_online(node))
+ node_set_online(node);
numa_set_node(cpu, node);
}
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d66bc8a..5d8db00 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5662,10 +5662,15 @@ static void __build_all_zonelists(void *data)
if (self && !node_online(self->node_id)) {
build_zonelists(self);
} else {
- for_each_online_node(nid) {
+ /* In rare case, node_zonelist() hits offline node */
+ for_each_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
-
- build_zonelists(pgdat);
+ /*
+ * This condition can be removed on archs, with all
+ * possible node instanced.
+ */
+ if (pgdat)
+ build_zonelists(pgdat);
}

#ifdef CONFIG_HAVE_MEMORYLESS_NODES
--
2.7.5