Subject: [PATCH] x86, ACPI, mm, numa: Fix problems caused by movablemem_map Tim found: [ 0.181441] WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80() [ 0.181443] Hardware name: S2600CP [ 0.181445] sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. [ 0.166925] smpboot: Booting Node 1, Processors #1 [ 0.181446] Modules linked in: [ 0.181451] Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1 [ 0.181452] Call Trace: [ 0.181457] [] ? topology_sane.isra.2+0x6f/0x80 [ 0.181463] [] warn_slowpath_common+0x7f/0xc0 [ 0.181469] [] warn_slowpath_fmt+0x4c/0x50 [ 0.181473] [] ? mcheck_cpu_init+0x378/0x3fb [ 0.181478] [] ? cpuid_eax+0x27/0x2e [ 0.181483] [] topology_sane.isra.2+0x6f/0x80 [ 0.181488] [] set_cpu_sibling_map+0x279/0x449 [ 0.181493] [] start_secondary+0x11d/0x1e5 [ 0.181507] ---[ end trace 8c24ebb220b8c665 ]--- Don Morris reproduced on a HP z620 workstation, and bisect to # bad: [e8d1955258091e4c92d5a975ebd7fd8a98f5d30f] acpi, memory-hotplug: parse SRAT before memblock is ready It turns out movable_map has some problems, and it breaks several things 1. numa_init is called several times, NOT just for srat. so those nodes_clear(numa_nodes_parsed) memset(&numa_meminfo, 0, sizeof(numa_meminfo)) can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy. and make fall back path working. 2. simply split acpi_numa_init to early_parse_srat. a. that early_parse_srat is NOT called for ia64, so you break ia64. b. for (i = 0; i < MAX_LOCAL_APIC; i++) set_apicid_to_node(i, NUMA_NO_NODE) still left in numa_init. So it will just clear result from early_parse_srat. it should be moved before that. c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved early before override from INITRD is settled. 3. that patch TITLE is total misleading, there is NO x86 in the title, but it changes critical x86 code. It caused x86 guys did not pay attention to find the problem early. Those patches really should be routed via tip/x86/mm. 4: after that commit, following range can not use movable ram: a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed? b. initrd... it will be freed after booting, so it could be on movable... c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G anymore. d. init_mem_mapping: can not put page table high anymore. e. initmem_init: vmemmap can not be high local node anymore. That is not good. If node is hotplugable, the mem related range like page table and vmemmap could be on the that node without problem and should be on that node. This patch is some kind of refreshment of | commit 1411e0ec3123ae4c4ead6bfc9fe3ee5a3ae5c327 | Date: Mon Dec 27 16:48:17 2010 -0800 | | x86-64, numa: Put pgtable to local node memory That was reverted before. Now have reason to introduce it again to make memory hotplug work. 1. init_mem_mapping only map up to max_low_pfn at first, and relocate initrd, so will keep acpi_override working. 2. then parse acpi tables includes srat. 3. init mapping during initmem_init for [max_low_pfn, max_pfn) so it will put page table on local node. Need to rework alloc_low_pages to alloc page table in following order: BRK, local node, low range 4. On 64bit kexec or kdump could load initrd much above 4G, and that initird could be relocated down below max_low_pfn by second kernel. That will have problem for kdump that will not have enough ram range below 4G. To fix that, max_low_pfn could be adjusted above 4G for 64bit platform. But that will limit initrd disk size to less 2G, that should be ok. 5. enable movablemem_map after vmemmap is setup, so vmemmap will on local node. Need to apply after x86, ACPI, mm: Kill max_low_pfn_mapped -v2: add missing changes setup.c during split patches. Reported-by: Tim Gardner Reported-by: Don Morris Bisected-by: Don Morris Signed-off-by: Yinghai Lu Cc: Tejun Heo --- arch/x86/include/asm/e820.h | 1 arch/x86/include/asm/pgtable.h | 2 - arch/x86/kernel/e820.c | 25 ++++++++++++ arch/x86/kernel/setup.c | 62 ++++++++++++++++++++++--------- arch/x86/mm/init.c | 82 ++++++++++++++++++++++++----------------- arch/x86/mm/init_64.c | 1 arch/x86/mm/numa.c | 26 ++++++++++--- drivers/acpi/numa.c | 22 +++-------- include/linux/acpi.h | 8 ---- include/linux/mm.h | 1 mm/memblock.c | 13 +++--- mm/page_alloc.c | 3 + 12 files changed, 157 insertions(+), 89 deletions(-) Index: linux-2.6/arch/x86/include/asm/pgtable.h =================================================================== --- linux-2.6.orig/arch/x86/include/asm/pgtable.h +++ linux-2.6/arch/x86/include/asm/pgtable.h @@ -621,7 +621,7 @@ static inline int pgd_none(pgd_t pgd) #ifndef __ASSEMBLY__ extern int direct_gbpages; -void init_mem_mapping(void); +void init_mem_mapping(unsigned long begin, unsigned long end); void early_alloc_pgt_buf(void); /* local pte updates need not use xchg for locking */ Index: linux-2.6/arch/x86/mm/init.c =================================================================== --- linux-2.6.orig/arch/x86/mm/init.c +++ linux-2.6/arch/x86/mm/init.c @@ -24,7 +24,10 @@ static unsigned long __initdata pgt_buf_ static unsigned long __initdata pgt_buf_end; static unsigned long __initdata pgt_buf_top; -static unsigned long min_pfn_mapped; +static unsigned long low_min_pfn_mapped; +static unsigned long low_max_pfn_mapped; +static unsigned long local_min_pfn_mapped; +static unsigned long local_max_pfn_mapped; static bool __initdata can_use_brk_pgt = true; @@ -52,10 +55,17 @@ __ref void *alloc_low_pages(unsigned int if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) { unsigned long ret; - if (min_pfn_mapped >= max_pfn_mapped) - panic("alloc_low_page: ran out of memory"); - ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT, - max_pfn_mapped << PAGE_SHIFT, + if (local_min_pfn_mapped >= local_max_pfn_mapped) { + if (low_min_pfn_mapped >= low_max_pfn_mapped) + panic("alloc_low_page: ran out of memory"); + ret = memblock_find_in_range( + low_min_pfn_mapped << PAGE_SHIFT, + low_max_pfn_mapped << PAGE_SHIFT, + PAGE_SIZE * num , PAGE_SIZE); + } else + ret = memblock_find_in_range( + local_min_pfn_mapped << PAGE_SHIFT, + local_max_pfn_mapped << PAGE_SHIFT, PAGE_SIZE * num , PAGE_SIZE); if (!ret) panic("alloc_low_page: can not alloc memory"); @@ -387,68 +397,72 @@ static unsigned long __init init_range_m /* (PUD_SHIFT-PMD_SHIFT)/2 */ #define STEP_SIZE_SHIFT 5 -void __init init_mem_mapping(void) +void __init init_mem_mapping(unsigned long begin, unsigned long end) { - unsigned long end, real_end, start, last_start; + unsigned long real_end, start, last_start; unsigned long step_size; unsigned long addr; unsigned long mapped_ram_size = 0; unsigned long new_mapped_ram_size; + bool is_low = false; - probe_page_size_mask(); - -#ifdef CONFIG_X86_64 - end = max_pfn << PAGE_SHIFT; -#else - end = max_low_pfn << PAGE_SHIFT; -#endif + if (!begin) { + probe_page_size_mask(); + /* the ISA range is always mapped regardless of memory holes */ + init_memory_mapping(0, ISA_END_ADDRESS); + begin = ISA_END_ADDRESS; + is_low = true; + } - /* the ISA range is always mapped regardless of memory holes */ - init_memory_mapping(0, ISA_END_ADDRESS); + if (begin >= end) + return; /* xen has big range in reserved near end of ram, skip it at first */ - addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, - PAGE_SIZE); + addr = memblock_find_in_range(begin, end, PMD_SIZE, PAGE_SIZE); real_end = addr + PMD_SIZE; /* step_size need to be small so pgt_buf from BRK could cover it */ step_size = PMD_SIZE; - max_pfn_mapped = 0; /* will get exact value next */ - min_pfn_mapped = real_end >> PAGE_SHIFT; + local_max_pfn_mapped = begin >> PAGE_SHIFT; + local_min_pfn_mapped = real_end >> PAGE_SHIFT; last_start = start = real_end; - while (last_start > ISA_END_ADDRESS) { + while (last_start > begin) { if (last_start > step_size) { start = round_down(last_start - 1, step_size); - if (start < ISA_END_ADDRESS) - start = ISA_END_ADDRESS; + if (start < begin) + start = begin; } else - start = ISA_END_ADDRESS; + start = begin; new_mapped_ram_size = init_range_memory_mapping(start, last_start); + if ((last_start >> PAGE_SHIFT) > local_max_pfn_mapped) + local_max_pfn_mapped = last_start >> PAGE_SHIFT; + local_min_pfn_mapped = start >> PAGE_SHIFT; last_start = start; - min_pfn_mapped = last_start >> PAGE_SHIFT; /* only increase step_size after big range get mapped */ if (new_mapped_ram_size > mapped_ram_size) step_size <<= STEP_SIZE_SHIFT; mapped_ram_size += new_mapped_ram_size; } - if (real_end < end) + if (real_end < end) { init_range_memory_mapping(real_end, end); + if ((end >> PAGE_SHIFT) > local_max_pfn_mapped) + local_max_pfn_mapped = end >> PAGE_SHIFT; + } -#ifdef CONFIG_X86_64 - if (max_pfn > max_low_pfn) { - /* can we preseve max_low_pfn ?*/ - max_low_pfn = max_pfn; + if (is_low) { + low_min_pfn_mapped = local_min_pfn_mapped; + low_max_pfn_mapped = local_max_pfn_mapped; } -#else - early_ioremap_page_table_range_init(); -#endif +#ifdef CONFIG_X86_32 + early_ioremap_page_table_range_init(); load_cr3(swapper_pg_dir); __flush_tlb_all(); +#endif - early_memtest(0, max_pfn_mapped << PAGE_SHIFT); + early_memtest(begin, end); } /* Index: linux-2.6/arch/x86/mm/numa.c =================================================================== --- linux-2.6.orig/arch/x86/mm/numa.c +++ linux-2.6/arch/x86/mm/numa.c @@ -211,13 +211,24 @@ static void __init setup_node_data(int n /* * Allocate node data. Try node-local memory and then any node. * Never allocate in DMA zone. + * Can not use memblock_alloc_nid() as memblock.current_limit is not + * set properly. */ - nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid); + nd_pa = memblock_find_in_range_node(start, end, nd_size, + SMP_CACHE_BYTES, nid); + if (!nd_pa) + nd_pa = memblock_find_in_range(start, end, nd_size, + SMP_CACHE_BYTES); + if (!nd_pa) + nd_pa = memblock_find_in_range(0, MEMBLOCK_ALLOC_ACCESSIBLE, + nd_size, SMP_CACHE_BYTES); if (!nd_pa) { pr_err("Cannot find %zu bytes in any node\n", nd_size); return; } + memblock_reserve(nd_pa, nd_size); nd = __va(nd_pa); + memset(nd, 0, nd_size); /* report and initialize */ printk(KERN_INFO " NODE_DATA [mem %#010Lx-%#010Lx]\n", @@ -520,8 +531,13 @@ static int __init numa_register_memblks( end = max(mi->blk[i].end, end); } - if (start < end) + if (start < end) { +#ifdef CONFIG_X86_64 + init_mem_mapping(max(start, PFN_PHYS(max_low_pfn)), + end); +#endif setup_node_data(nid, start, end); + } } /* Dump memblock with node info and return. */ @@ -559,12 +575,10 @@ static int __init numa_init(int (*init_f for (i = 0; i < MAX_LOCAL_APIC; i++) set_apicid_to_node(i, NUMA_NO_NODE); - /* - * Do not clear numa_nodes_parsed or zero numa_meminfo here, because - * SRAT was parsed earlier in early_parse_srat(). - */ + nodes_clear(numa_nodes_parsed); nodes_clear(node_possible_map); nodes_clear(node_online_map); + memset(&numa_meminfo, 0, sizeof(numa_meminfo)); WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES)); numa_reset_distance(); Index: linux-2.6/drivers/acpi/numa.c =================================================================== --- linux-2.6.orig/drivers/acpi/numa.c +++ linux-2.6/drivers/acpi/numa.c @@ -282,15 +282,10 @@ acpi_table_parse_srat(enum acpi_srat_typ handler, max_entries); } -static int srat_mem_cnt; -void __init early_parse_srat(void) +int __init acpi_numa_init(void) { - /* - * Should not limit number with cpu num that is from NR_CPUS or nr_cpus= - * SRAT cpu entries could have different order with that in MADT. - * So go over all cpu entries in SRAT to get apicid to node mapping. - */ + int cnt = 0; /* SRAT: Static Resource Affinity Table */ if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) { @@ -298,21 +293,18 @@ void __init early_parse_srat(void) acpi_parse_x2apic_affinity, 0); acpi_table_parse_srat(ACPI_SRAT_TYPE_CPU_AFFINITY, acpi_parse_processor_affinity, 0); - srat_mem_cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, - acpi_parse_memory_affinity, - NR_NODE_MEMBLKS); + cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY, + acpi_parse_memory_affinity, + NR_NODE_MEMBLKS); } -} -int __init acpi_numa_init(void) -{ /* SLIT: System Locality Information Table */ acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit); acpi_numa_arch_fixup(); - if (srat_mem_cnt < 0) - return srat_mem_cnt; + if (cnt < 0) + return cnt; else if (!parsed_numa_memblks) return -ENOENT; return 0; Index: linux-2.6/include/linux/acpi.h =================================================================== --- linux-2.6.orig/include/linux/acpi.h +++ linux-2.6/include/linux/acpi.h @@ -485,14 +485,6 @@ static inline bool acpi_driver_match_dev #endif /* !CONFIG_ACPI */ -#ifdef CONFIG_ACPI_NUMA -void __init early_parse_srat(void); -#else -static inline void early_parse_srat(void) -{ -} -#endif - #ifdef CONFIG_ACPI void acpi_os_set_prepare_sleep(int (*func)(u8 sleep_state, u32 pm1a_ctrl, u32 pm1b_ctrl)); Index: linux-2.6/arch/x86/mm/init_64.c =================================================================== --- linux-2.6.orig/arch/x86/mm/init_64.c +++ linux-2.6/arch/x86/mm/init_64.c @@ -643,6 +643,7 @@ kernel_physical_mapping_init(unsigned lo void __init initmem_init(void) { memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0); + init_mem_mapping(max_low_pfn<= 0; curr--) { - if ((movablemem_map.map[curr].start_pfn << PAGE_SHIFT) - < this_end) - break; - } + if (movablemem_map.enable) + for (; curr >= 0; curr--) { + if ((movablemem_map.map[curr].start_pfn << + PAGE_SHIFT) < this_end) + break; + } cand = round_down(this_end - size, align); - if (curr >= 0 && + if (movablemem_map.enable && curr >= 0 && cand < movablemem_map.map[curr].end_pfn << PAGE_SHIFT) { this_end = movablemem_map.map[curr].start_pfn << PAGE_SHIFT; Index: linux-2.6/mm/page_alloc.c =================================================================== --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -204,7 +204,8 @@ static unsigned long __meminitdata dma_r #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP /* Movable memory ranges, will also be used by memblock subsystem. */ struct movablemem_map movablemem_map = { - .acpi = false, + .enable = false, + .acpi = false, .nr_map = 0, }; Index: linux-2.6/arch/x86/kernel/e820.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/e820.c +++ linux-2.6/arch/x86/kernel/e820.c @@ -1097,6 +1097,31 @@ void __init memblock_x86_fill(void) memblock_dump_all(); } +unsigned long __init memblock_find_max_low_pfn(unsigned long nr_free_pages) +{ + unsigned long start_pfn, end_pfn, last_good_end_pfn = 0; + phys_addr_t start, end; + unsigned long nr = 0; + u64 u; + + for_each_free_mem_range(u, MAX_NUMNODES, &start, &end, NULL) { + start_pfn = PFN_UP(start); + end_pfn = PFN_DOWN(end); + if (start_pfn >= end_pfn) + continue; + + if ((end_pfn - start_pfn) < (nr_free_pages - nr)) { + nr += end_pfn - start_pfn; + last_good_end_pfn = end_pfn; + continue; + } + + return start_pfn + (nr_free_pages - nr); + } + + return last_good_end_pfn; +} + void __init memblock_find_dma_reserve(void) { #ifdef CONFIG_X86_64 Index: linux-2.6/arch/x86/include/asm/e820.h =================================================================== --- linux-2.6.orig/arch/x86/include/asm/e820.h +++ linux-2.6/arch/x86/include/asm/e820.h @@ -53,6 +53,7 @@ extern unsigned long e820_end_of_low_ram extern u64 early_reserve_e820(u64 sizet, u64 align); void memblock_x86_fill(void); +unsigned long memblock_find_max_low_pfn(unsigned long nr_free_pages); void memblock_find_dma_reserve(void); extern void finish_e820_parsing(void); Index: linux-2.6/arch/x86/kernel/setup.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/setup.c +++ linux-2.6/arch/x86/kernel/setup.c @@ -763,6 +763,24 @@ static void __init e820_add_kernel_range e820_add_region(start, size, E820_RAM); } +static void __init adjust_max_low_pfn(void) +{ +#ifdef CONFIG_X86_64 + if (max_pfn > (1UL<<(32 - PAGE_SHIFT))) { + unsigned long pfn; + + /* make sure max_low_pfn at least 4G free range */ + pfn = memblock_find_max_low_pfn(1UL<<(32-PAGE_SHIFT)); + if (pfn > max_low_pfn) { + /* round up to 1G boundary */ + max_low_pfn = round_up(pfn, (1UL<<(30-PAGE_SHIFT))); + if (max_low_pfn > max_pfn) + max_low_pfn = max_pfn; + } + } +#endif +} + static unsigned reserve_low = CONFIG_X86_RESERVE_LOW << 10; static int __init parse_reservelow(char *p) @@ -1054,15 +1072,6 @@ void __init setup_arch(char **cmdline_p) setup_bios_corruption_check(); #endif - /* - * In the memory hotplug case, the kernel needs info from SRAT to - * determine which memory is hotpluggable before allocating memory - * using memblock. - */ - acpi_boot_table_init(); - early_acpi_boot_init(); - early_parse_srat(); - #ifdef CONFIG_X86_32 printk(KERN_DEBUG "initial memory mapped: [mem 0x00000000-%#010lx]\n", (max_pfn_mapped< max_low_pfn) { + /* can we preseve max_low_pfn ?*/ + max_low_pfn = max_pfn; + } + load_cr3(swapper_pg_dir); + __flush_tlb_all(); +#endif + early_trap_pf_init(); + setup_real_mode(); + + dma_contiguous_reserve(0); + + reserve_crashkernel(); + memblock_find_dma_reserve(); #ifdef CONFIG_KVM_GUEST @@ -1117,6 +1138,11 @@ void __init setup_arch(char **cmdline_p) x86_init.paging.pagetable_init(); + if (movablemem_map.nr_map) { + printk(KERN_DEBUG "movablemem_map is enabled!\n"); + movablemem_map.enable = true; + } + if (boot_cpu_data.cpuid_level >= 0) { /* A CPU has %cr4 if and only if it has CPUID */ mmu_cr4_features = read_cr4();