CONFIG_HOLES_IN_ZONE and memory hot plug code on x86_64

From: Steffen Persvold
Date: Fri Jun 26 2015 - 19:31:55 EST


Hi,

Weâve encountered an issue in a special case where we have a sparse E820 map [1].

Basically the memory hotplug code is causing a âkernel paging requestâ BUG [2].

By instrumenting the function register_mem_sect_under_node() in drivers/base/node.c we see that it is called two times with the same struct memory_block argument :

[ 1.901463] register_mem_sect_under_node: start = 80, end = 8f, nid = 0
[ 1.908129] register_mem_sect_under_node: start = 80, end = 8f, nid = 1

The second call is causing paging request because the for loop in register_mem_sect_under_node() is scanning pfns :

for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {


and canât find one that matches the input ânidâ argument (1), which is natural enough because those sections does not belong to node1, but rather node0. This results in the for loop entering a âholeâ in the pfn range which isnât mapped.

Now, the code appears to have been designed to handle this by checking if the pfn really belongs to this node with the the function get_nid_for_pfn() in the same file :

static int get_nid_for_pfn(unsigned long pfn)
{
struct page *page;

if (!pfn_valid_within(pfn))
return -1;
page = pfn_to_page(pfn);
if (!page_initialized(page))
return -1;
return pfn_to_nid(pfn);
}

However, pfn_valid_within() (from include/linux/mmzone.h) is not getting a false return value because :

/*
* If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we
* need to check pfn validility within that MAX_ORDER_NR_PAGES block.
* pfn_valid_within() should be used in this case; we optimise this away
* when we have no holes within a MAX_ORDER_NR_PAGES block.
*/
#ifdef CONFIG_HOLES_IN_ZONE
#define pfn_valid_within(pfn) pfn_valid(pfn)
#else
#define pfn_valid_within(pfn) (1)
#endif


CONFIG_HOLES_IN_ZONE is not possible to set on x86_64, it is present only on ia64 and mips.

Is there a specific reason why CONFIG_HOLES_IN_ZONE isnât activated on x86_64 ? Iâve added a patch to arch/x86/Kconfig [3] which solves this issue, however I guess another approach would be to figure out why register_mem_sect_under_node() is called with a wrong struct memory_block for node1....

Any comments or suggestions are welcome.

PS: Even if we avoid the sparse e820 map, register_mem_sect_under_node() is still invoked twice with the same struct memory_block once for node0 (which gets a match) and once for node1. However when all the pfns are mapped, it just goes through the range just fine without a paging request.

Cheers,
--
Steffen Persvold
Chief Architect NumaChip, Numascale AS
Tel: +47 23 16 71 88 Fax: +47 23 16 71 80 Skype: spersvold

[1]

[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000087fff] usable
[ 0.000000] BIOS-e820: [mem 0x0000000000088000-0x0000000000089bff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000089c00-0x000000000009ebff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009ec00-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e84e0-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000d7e5ffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000d7e6e000-0x00000000d7e6ffff] type 9
[ 0.000000] BIOS-e820: [mem 0x00000000d7e70000-0x00000000d7e93fff] ACPI data
[ 0.000000] BIOS-e820: [mem 0x00000000d7e94000-0x00000000d7ebffff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000d7ec0000-0x00000000d7edffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000d7eed000-0x00000000d7ffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000ffe00000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x0000000407ffffff] usable
[ 0.000000] BIOS-e820: [mem 0x0000000408000000-0x0000000427ffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000428000000-0x0000000807ffffff] usable
[ 0.000000] BIOS-e820: [mem 0x0000000808000000-0x0000000827ffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000828000000-0x0000000c07ffffff] usable
[ 0.000000] BIOS-e820: [mem 0x0000000c08000000-0x0000000c27ffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000c28000000-0x0000001007ffffff] usable
[ 0.000000] BIOS-e820: [mem 0x0000001008000000-0x0000001027ffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000001028000000-0x0000001407ffffff] usable
[ 0.000000] BIOS-e820: [mem 0x0000001408000000-0x0000001427ffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000001428000000-0x0000001807ffffff] usable
[ 0.000000] BIOS-e820: [mem 0x0000001808000000-0x0000001827ffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x000000fd00000000-0x000000ffffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00003f0000000000-0x00003fffffffffff] reserved

[2]

[ 1.915002] BUG: unable to handle kernel paging request at ffffea0010200020
[ 1.922003] IP: [<ffffffff8176bc31>] register_mem_sect_under_node+0x91/0x100
[ 1.929074] PGD 407fdb067 PUD 407fda067 PMD 0
[ 1.933569] Oops: 0000 [#1] SMP
[ 1.936830] Modules linked in:
[ 1.939920] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.1.0-numascale51+ #11
[ 1.946971] Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.5b 01/28/2015
[ 1.954104] task: ffff8803f7f10c40 ti: ffff8803f7f14000 task.ti: ffff8803f7f14000
[ 1.961590] RIP: 0010:[<ffffffff8176bc31>] [<ffffffff8176bc31>] register_mem_sect_under_node+0x91/0x100
[ 1.971089] RSP: 0018:ffff8803f7f17d78 EFLAGS: 00010206
[ 1.976404] RAX: ffffea0010200020 RBX: ffff8807f74a2000 RCX: 0000000000000000
[ 1.983541] RDX: 0000000000408000 RSI: 000000000047ffff RDI: ffffffff825cd968
[ 1.990680] RBP: ffff8803f7f17d98 R08: 000000000000000a R09: 0000000000000400
[ 1.997821] R10: 000000000000015a R11: ffff8803f7f17a78 R12: 0000000000000001
[ 2.004958] R13: 0000000000000001 R14: ffff8807f74a2000 R15: 0000000000000000
[ 2.012092] FS: 0000000000000000(0000) GS:ffff8803f8000000(0000) knlGS:0000000000000000
[ 2.020184] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2.025927] CR2: ffffea0010200020 CR3: 000000000220d000 CR4: 00000000000406b0
[ 2.033058] Stack:
[ 2.035095] ffff8803f7f17d98 0000000000428000 0000000000000001 0000000000000001
[ 2.042559] ffff8803f7f17de8 ffffffff8176c04a ffff8803f7f17da8 0000000000808000
[ 2.050019] ffff8803f80193c0 0000000000000002 0000000000000400 0000000000000000
[ 2.057481] Call Trace:
[ 2.059930] [<ffffffff8176c04a>] register_one_node+0x1ba/0x260
[ 2.065854] [<ffffffff8231e894>] ? enable_cpu0_hotplug+0x15/0x15
[ 2.071954] [<ffffffff8231e8d0>] topology_init+0x3c/0x95
[ 2.077358] [<ffffffff810002c4>] do_one_initcall+0x84/0x1b0
[ 2.083024] [<ffffffff810dcca3>] ? __wake_up+0x43/0x60
[ 2.088254] [<ffffffff8231810e>] kernel_init_freeable+0x166/0x1f1
[ 2.094440] [<ffffffff823178b8>] ? initcall_blacklist+0xad/0xad
[ 2.100453] [<ffffffff81b19500>] ? rest_init+0x80/0x80
[ 2.105683] [<ffffffff81b19509>] kernel_init+0x9/0xf0
[ 2.110828] [<ffffffff81b34d22>] ret_from_fork+0x42/0x70
[ 2.116233] [<ffffffff81b19500>] ? rest_init+0x80/0x80
[ 2.121464] Code: ff 7f 00 00 48 39 f2 77 be 48 c1 e0 15 48 b9 00 00 00 00 00 ea ff ff 48 8d 44 08 20 eb 0d 48 83 c2 01 48 83 c0 40 48 39 d6 72 9c <48> 83 38 00 74 ed 48 8b 48 e0 48 c1 e9 36 41 39 cc 75 e0 4a 8b
[ 2.141443] RIP [<ffffffff8176bc31>] register_mem_sect_under_node+0x91/0x100
[ 2.148616] RSP <ffff8803f7f17d78>
[ 2.152104] CR2: ffffea0010200020
[ 2.155423] ---[ end trace 74baf61bb679da4f ]â

[3]
---
arch/x86/Kconfig | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d5696..753e42b6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1403,6 +1403,10 @@ config ARCH_SELECT_MEMORY_MODEL
def_bool y
depends on ARCH_SPARSEMEM_ENABLE

+config HOLES_IN_ZONE
+ bool
+ default y if ARCH_SPARSEMEM_DEFAULT
+
config ARCH_MEMORY_PROBE
bool "Enable sysfs memory/probe interface"
depends on X86_64 && MEMORY_HOTPLUG
--




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/