Re: [PATCH v2 1/5] mm: memory_hotplug: Memory hotplug (add) support for arm64

From: Arun KS
Date: Fri Nov 24 2017 - 00:55:18 EST


On Thu, Nov 23, 2017 at 4:43 PM, Maciej Bielski
<m.bielski@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> Introduces memory hotplug functionality (hot-add) for arm64.
>
> Changes v1->v2:
> - swapper pgtable updated in place on hot add, avoiding unnecessary copy:
> all changes are additive and non destructive.
>
> - stop_machine used to updated swapper on hot add, avoiding races
>
> - checking if pagealloc is under debug to stay coherent with mem_map
>
> Signed-off-by: Maciej Bielski <m.bielski@xxxxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Andrea Reale <ar@xxxxxxxxxxxxxxxxxx>
> ---
> arch/arm64/Kconfig | 12 ++++++
> arch/arm64/configs/defconfig | 1 +
> arch/arm64/include/asm/mmu.h | 3 ++
> arch/arm64/mm/init.c | 87 ++++++++++++++++++++++++++++++++++++++++++++
> arch/arm64/mm/mmu.c | 39 ++++++++++++++++++++
> 5 files changed, 142 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0df64a6..c736bba 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -641,6 +641,14 @@ config HOTPLUG_CPU
> Say Y here to experiment with turning CPUs off and on. CPUs
> can be controlled through /sys/devices/system/cpu.
>
> +config ARCH_HAS_ADD_PAGES
> + def_bool y
> + depends on ARCH_ENABLE_MEMORY_HOTPLUG
> +
> +config ARCH_ENABLE_MEMORY_HOTPLUG
> + def_bool y
> + depends on !NUMA
> +
> # Common NUMA Features
> config NUMA
> bool "Numa Memory Allocation and Scheduler Support"
> @@ -715,6 +723,10 @@ config ARCH_HAS_CACHE_LINE_SIZE
>
> source "mm/Kconfig"
>
> +config ARCH_MEMORY_PROBE
> + def_bool y
> + depends on MEMORY_HOTPLUG
> +
> config SECCOMP
> bool "Enable seccomp to safely compute untrusted bytecode"
> ---help---
> diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
> index 34480e9..5fc5656 100644
> --- a/arch/arm64/configs/defconfig
> +++ b/arch/arm64/configs/defconfig
> @@ -80,6 +80,7 @@ CONFIG_ARM64_VA_BITS_48=y
> CONFIG_SCHED_MC=y
> CONFIG_NUMA=y
> CONFIG_PREEMPT=y
> +CONFIG_MEMORY_HOTPLUG=y
> CONFIG_KSM=y
> CONFIG_TRANSPARENT_HUGEPAGE=y
> CONFIG_CMA=y
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 0d34bf0..2b3fa4d 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -40,5 +40,8 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> pgprot_t prot, bool page_mappings_only);
> extern void *fixmap_remap_fdt(phys_addr_t dt_phys);
> extern void mark_linear_text_alias_ro(void);
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +extern void hotplug_paging(phys_addr_t start, phys_addr_t size);
> +#endif
>
> #endif
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 5960bef..e96e7d3 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -722,3 +722,90 @@ static int __init register_mem_limit_dumper(void)
> return 0;
> }
> __initcall(register_mem_limit_dumper);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +int add_pages(int nid, unsigned long start_pfn,
> + unsigned long nr_pages, bool want_memblock)
> +{
> + int ret;
> + u64 start_addr = start_pfn << PAGE_SHIFT;
> + /*
> + * Mark the first page in the range as unusable. This is needed
> + * because __add_section (within __add_pages) wants pfn_valid
> + * of it to be false, and in arm64 pfn falid is implemented by
> + * just checking at the nomap flag for existing blocks.
> + *
> + * A small trick here is that __add_section() requires only
> + * phys_start_pfn (that is the first pfn of a section) to be
> + * invalid. Regardless of whether it was assumed (by the function
> + * author) that all pfns within a section are either all valid
> + * or all invalid, it allows to avoid looping twice (once here,
> + * second when memblock_clear_nomap() is called) through all
> + * pfns of the section and modify only one pfn. Thanks to that,
> + * further, in __add_zone() only this very first pfn is skipped
> + * and corresponding page is not flagged reserved. Therefore it
> + * is enough to correct this setup only for it.
> + *
> + * When arch_add_memory() returns the walk_memory_range() function
> + * is called and passed with online_memory_block() callback,
> + * which execution finally reaches the memory_block_action()
> + * function, where also only the first pfn of a memory block is
> + * checked to be reserved. Above, it was first pfn of a section,
> + * here it is a block but
> + * (drivers/base/memory.c):
> + * sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
> + * (include/linux/memory.h):
> + * #define MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS)
> + * so we can consider block and section equivalently
> + */
> + memblock_mark_nomap(start_addr, 1<<PAGE_SHIFT);
> + ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> + /*
> + * Make the pages usable after they have been added.
> + * This will make pfn_valid return true
> + */
> + memblock_clear_nomap(start_addr, 1<<PAGE_SHIFT);
> +
> + /*
> + * This is a hack to avoid having to mix arch specific code
> + * into arch independent code. SetPageReserved is supposed
> + * to be called by __add_zone (within __add_section, within
> + * __add_pages). However, when it is called there, it assumes that
> + * pfn_valid returns true. For the way pfn_valid is implemented
> + * in arm64 (a check on the nomap flag), the only way to make
> + * this evaluate true inside __add_zone is to clear the nomap
> + * flags of blocks in architecture independent code.
> + *
> + * To avoid this, we set the Reserved flag here after we cleared
> + * the nomap flag in the line above.
> + */
> + SetPageReserved(pfn_to_page(start_pfn));
> +
> + return ret;
> +}
> +
> +int arch_add_memory(int nid, u64 start, u64 size, bool want_memblock)
> +{
> + int ret;
> + unsigned long start_pfn = start >> PAGE_SHIFT;
> + unsigned long nr_pages = size >> PAGE_SHIFT;
> + unsigned long end_pfn = start_pfn + nr_pages;
> + unsigned long max_sparsemem_pfn = 1UL << (MAX_PHYSMEM_BITS-PAGE_SHIFT);
> +
> + if (end_pfn > max_sparsemem_pfn) {
> + pr_err("end_pfn too big");
> + return -1;
> + }
> + hotplug_paging(start, size);
> +
> + ret = add_pages(nid, start_pfn, nr_pages, want_memblock);
> +
> + if (ret)
> + pr_warn("%s: Problem encountered in __add_pages() ret=%d\n",
> + __func__, ret);
> +
> + return ret;
> +}
> +
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index f1eb15e..d93043d 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -28,6 +28,7 @@
> #include <linux/mman.h>
> #include <linux/nodemask.h>
> #include <linux/memblock.h>
> +#include <linux/stop_machine.h>
> #include <linux/fs.h>
> #include <linux/io.h>
> #include <linux/mm.h>
> @@ -615,6 +616,44 @@ void __init paging_init(void)
> SWAPPER_DIR_SIZE - PAGE_SIZE);
> }
>
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +
> +/*
> + * hotplug_paging() is used by memory hotplug to build new page tables
> + * for hot added memory.
> + */
> +
> +struct mem_range {
> + phys_addr_t base;
> + phys_addr_t size;
> +};
> +
> +static int __hotplug_paging(void *data)
> +{
> + int flags = 0;
> + struct mem_range *section = data;
> +
> + if (debug_pagealloc_enabled())
> + flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> +
> + __create_pgd_mapping(swapper_pg_dir, section->base,
> + __phys_to_virt(section->base), section->size,
> + PAGE_KERNEL, pgd_pgtable_alloc, flags);

Hello Andrea,

__hotplug_paging runs on stop_machine context.
cpu stop callbacks must not sleep.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/stop_machine.c?h=v4.14#n479

__create_pgd_mapping uses pgd_pgtable_alloc. which does
__get_free_page(PGALLOC_GFP)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/mm/mmu.c?h=v4.14#n342

PGALLOC_GFP has GFP_KERNEL which inturn has __GFP_RECLAIM

#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)

Now, prepare_alloc_pages() called by __alloc_pages_nodemask checks for

might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v4.14#n4150

and then BUG()

I was testing on 4.4 kernel, but cross checked with 4.14 as well.

Regards,
Arun


> +
> + return 0;
> +}
> +
> +inline void hotplug_paging(phys_addr_t start, phys_addr_t size)
> +{
> + struct mem_range section = {
> + .base = start,
> + .size = size,
> + };
> +
> + stop_machine(__hotplug_paging, &section, NULL);
> +}
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
> /*
> * Check whether a kernel address is valid (derived from arch/x86/).
> */
> --
> 2.7.4
>