[PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory groups

From: David Hildenbrand
Date: Mon Jun 07 2021 - 15:54:51 EST


Hi,

this series aims at improving in-kernel auto-online support. It tackles the
fundamental problems that:

1) We can create zone imbalances when onlining all memory blindly to
ZONE_MOVABLE, in the worst case crashing the system. We have to know
upfront how much memory we are going to hotplug such that we can
safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
via "online_movable". This is far from practical and only applicable in
limited setups -- like inside VMs under the RHV/oVirt hypervisor which
will never hotplug more than 3 times the boot memory (and the
limitation is only in place due to the Linux limitation).

2) We see more setups that implement dynamic VM resizing, hot(un)plugging
memory to resize VM memory. In these setups, we might hotplug a lot of
memory, but it might happen in various small steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
primary driver of this upstream right now, performing such dynamic
resizing NUMA-aware via multiple virtio-mem devices.

Onlining all hotplugged memory to ZONE_NORMAL means we basically have
no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
easily run into zone imbalances when growing a VM. We want a mixture,
and we want as much memory as reasonable/configured in ZONE_MOVABLE.

3) Memory devices consist of 1..X memory block devices, however, the
kernel doesn't really track the relationship. Consequently, also user
space has no idea. We want to make per-device decisions. As one
example, for memory hotunplug it doesn't make sense to use a mixture of
zones within a single DIMM: we want all MOVABLE if possible, otherwise
all !MOVABLE, because any !MOVABLE part will easily block the DIMM from
getting hotunplugged. As another example, virtio-mem operates on
individual units that span 1..X memory blocks. Similar to a DIMM, we
want a unit to either be all MOVABLE or !MOVABLE. Further, we want
as much memory of a virtio-mem device to be MOVABLE as possible.

4) We want memory onlining to be done right from the kernel while adding
memory; for example, this is reqired for fast memory hotplug for
drivers that add individual memory blocks, like virito-mem. We want a
way to configure a policy in the kernel and avoid implementing advanced
policies in user space.

The auto-onlining support we have in the kernel is not sufficient. All we
have is a) online everything movable (online_movable) b) online everything
!movable (online_kernel) c) keep zones contiguous (online). This series
allows configuring c) to mean instead "online movable if possible according
to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new
onlining policy.

This series does 3 things:

1) Introduces the "auto-movable" online policy that initially operates on
individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
to make a decision whether a memory block will be onlined to
ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
memory does not allow for more MOVABLE memory (details in the
patches). CMA memory is treated like MOVABLE memory.

2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
groups and uses group information to make decisions in the
"auto-movable" online policy accross memory blocks of a single memory
device (modeled as memory group).

3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
allowing ZONE_NORMAL memory within a dynamic memory group to allow for
more ZONE_MOVABLE memory within the same memory group. The target use
case is dynamic VM resizing using virtio-mem.

I remember that the basic idea of using a ratio to implement a policy in
the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong
(I lost the pointer to that discussion).

For me, the main use case is using it along with virtio-mem (and
DIMMs / ppc64 dlpar where necessary) for dynamic resizing of VMs,
increasing the amount of memory we can hotunplug reliably again if we
might eventually hotplug a lot of memory to a VM.


The target usage will be:

1) Linux boots with "mhp_default_online_type=offline"

2) User space (e.g., systemd unit) configures memory onlining (according
to a config file and system properties), for example:
* Setting memory_hotplug.online_policy=auto-movable
* Setting memory_hotplug.auto_movable_ratio=301
* Setting memory_hotplug.auto_movable_numa_aware=true

3) User space enabled auto onlining via "echo online >
/sys/devices/system/memory/auto_online_blocks"

4) User space triggers manual onlining of all already-offline memory
blocks (go over offline memory blocks and set them to "online")


For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
301% results in the following layout:
Memory block 1-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-79: Movable (DIMM 0)
Memory block 80-111: Movable (DIMM 1)
Memory block 112-143: Movable (DIMM 2)
Memory block 144-275: Normal (DIMM 3)
Memory block 176-207: Normal (DIMM 4)
... all Normal
(-> hotplugged Normal memory does not allow for more Movable memory)

For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
will result in the following layout:
Memory block 1-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-143: Movable (virtio-mem, first 12 GiB)
Memory block 144: Normal (virtio-mem, next 128 MiB)
Memory block 145-147: Movable (virtio-mem, next 384 MiB)
Memory block 148: Normal (virtio-mem, next 128 MiB)
Memory block 149-151: Movable (virtio-mem, next 384 MiB)
... Normal/Movable mixture as above
(-> hotplugged Normal memory allows for more Movable memory within
the same device)

Which gives us maximum flexibility when dynamically growing/shrinking a
VM in smaller steps. When shrinking, virtio-mem will prioritize unplug of
MOVABLE memory with [1] sent last week, such that we won't accidentially
trigger zone imbalances in more complicated setups that involve multiple
virtio-mem devices.


I'll update the memory-hotplug.rst documentation separately, once the
overhaul is done. For now I decided to use module parameters that can be
changed at runtime and not add new sysfs files for configuration. Easier
and cleaner IMHO -- especially, temporarily overwritable via the cmdline.

Future work:
- Use a single static memory group for dax/kmem
- Sense upfront for DIMMs (and dax/kmem) which pieces can actually be added
and won't actually contribute to the static memory group size
- Use memory groups for ppc64 dlpar
- More tunables. For example, a way to configure to keep some memory
default offline (s390x standby memory, dax/kmem)
- Indicate to user space that MOVABLE might be a bad idea -- especially
relevant when memory ballooning without support for balloon compaction
is active.

Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
Cc: "Michael S. Tsirkin" <mst@xxxxxxxxxx>
Cc: Jason Wang <jasowang@xxxxxxxxxx>
Cc: Marek Kedzierski <mkedzier@xxxxxxxxxx>
Cc: Hui Zhu <teawater@xxxxxxxxx>
Cc: Pankaj Gupta <pankaj.gupta.linux@xxxxxxxxx>
Cc: Wei Yang <richard.weiyang@xxxxxxxxxxxxxxxxx>
Cc: Oscar Salvador <osalvador@xxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxxxx>
Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
Cc: Anshuman Khandual <anshuman.khandual@xxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Mike Rapoport <rppt@xxxxxxxxxx>
Cc: "Rafael J. Wysocki" <rjw@xxxxxxxxxxxxx>
Cc: Len Brown <lenb@xxxxxxxxxx>
Cc: Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>
Cc: virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
Cc: linux-mm@xxxxxxxxx
Cc: linux-acpi@xxxxxxxxxxxxxxx

[1] https://lkml.kernel.org/r/20210602185720.31821-1-david@xxxxxxxxxx

David Hildenbrand (12):
mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()
mm: track present early pages per zone
mm/memory_hotplug: introduce "auto-movable" online policy
mm/memory_hotplug: remove nid parameter from arch_remove_memory()
mm/memory_hotplug: remove nid parameter from remove_memory() and
friends
drivers/base/memory: "memory groups" to logically group memory blocks
mm/memory_hotplug: track present pages in memory groups
ACPI: memhotplug: memory resources cannot be enabled yet
ACPI: memhotplug: use a single static memory group for a single memory
device
virtio-mem: use a single dynamic memory group for a single virtio-mem
device
mm/memory_hotplug: memory group aware "auto-movable" online policy
mm/memory_hotplug: improved dynamic memory group aware "auto-movable"
online policy

arch/arm64/mm/mmu.c | 3 +-
arch/ia64/mm/init.c | 3 +-
arch/powerpc/mm/mem.c | 3 +-
.../platforms/pseries/hotplug-memory.c | 9 +-
arch/s390/mm/init.c | 3 +-
arch/sh/mm/init.c | 3 +-
arch/x86/mm/init_32.c | 3 +-
arch/x86/mm/init_64.c | 3 +-
drivers/acpi/acpi_memhotplug.c | 46 ++-
drivers/base/memory.c | 163 +++++++--
drivers/dax/kmem.c | 3 +-
drivers/virtio/virtio_mem.c | 26 +-
include/linux/memory.h | 53 ++-
include/linux/memory_hotplug.h | 35 +-
include/linux/mmzone.h | 7 +
mm/memory_hotplug.c | 346 +++++++++++++++++-
mm/memremap.c | 5 +-
mm/page_alloc.c | 3 +
18 files changed, 618 insertions(+), 99 deletions(-)


base-commit: 614124bea77e452aa6df7a8714e8bc820b489922
--
2.31.1