Re: [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory groups

From: David Hildenbrand
Date: Tue Jun 08 2021 - 06:12:19 EST

On 08.06.21 11:42, Oscar Salvador wrote:
On Mon, Jun 07, 2021 at 09:54:18PM +0200, David Hildenbrand wrote:

this series aims at improving in-kernel auto-online support. It tackles the
fundamental problems that:

Hi David,

the idea sounds good to me, and I like that this series takes away part of the
responsability from the user to know where the memory should go.
I think the kernel is a much better fit for that as it has all the required
information to balance things.

I also glanced over the series and besides some things here and there the
whole approach looks sane.
I plan to have a look into it in a few days, just have some high level questions
for the time being:

Hi Oscar,

1) We can create zone imbalances when onlining all memory blindly to
ZONE_MOVABLE, in the worst case crashing the system. We have to know
upfront how much memory we are going to hotplug such that we can
safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
via "online_movable". This is far from practical and only applicable in
limited setups -- like inside VMs under the RHV/oVirt hypervisor which
will never hotplug more than 3 times the boot memory (and the
limitation is only in place due to the Linux limitation).

Could you give more insight about the problems created by zone imbalances (e.g:
a lot of movable memory and little kernel memory).

I just updated memory-hotplug.rst exactly for that purpose :)

There, also safe zone ratios and "usually well known values" are given. I can link it in the next cover letter.

2) We see more setups that implement dynamic VM resizing, hot(un)plugging
memory to resize VM memory. In these setups, we might hotplug a lot of
memory, but it might happen in various small steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
primary driver of this upstream right now, performing such dynamic
resizing NUMA-aware via multiple virtio-mem devices.

Onlining all hotplugged memory to ZONE_NORMAL means we basically have
no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
easily run into zone imbalances when growing a VM. We want a mixture,
and we want as much memory as reasonable/configured in ZONE_MOVABLE.

3) Memory devices consist of 1..X memory block devices, however, the
kernel doesn't really track the relationship. Consequently, also user
space has no idea. We want to make per-device decisions. As one
example, for memory hotunplug it doesn't make sense to use a mixture of
zones within a single DIMM: we want all MOVABLE if possible, otherwise
all !MOVABLE, because any !MOVABLE part will easily block the DIMM from
getting hotunplugged. As another example, virtio-mem operates on
individual units that span 1..X memory blocks. Similar to a DIMM, we
want a unit to either be all MOVABLE or !MOVABLE. Further, we want
as much memory of a virtio-mem device to be MOVABLE as possible.

So, a virtio-mem unit could be seen as DIMM right?

It's a bit more complicated. Each individual unit (e.g., a 128 MiB memory block) is the smallest granularity we can add/remove of that device. So such a unit is somewhat like a DIMM. However, all "units" of the device can interact -- it's a single memory device.

4) We want memory onlining to be done right from the kernel while adding
memory; for example, this is reqired for fast memory hotplug for
drivers that add individual memory blocks, like virito-mem. We want a
way to configure a policy in the kernel and avoid implementing advanced
policies in user space.

"we want memory onlining to be done right from the kernel while adding memory"

is not that always the case when a driver adds memory? User has no interaction
with that right?

Well, with auto-onlining in the kernel disabled, user space has to do the onlining -- for example via udev rules right now in major distributions.

But there are also users that always want to online manually in user space to select a zone. Most prominently standby memory on s390x, but also in some cases dax/kmem memory. But these two are really corner cases. In general, we want hotplugged memory to be onlined immediately.

The auto-onlining support we have in the kernel is not sufficient. All we
have is a) online everything movable (online_movable) b) online everything
!movable (online_kernel) c) keep zones contiguous (online). This series
allows configuring c) to mean instead "online movable if possible according
to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new
onlining policy.

This series does 3 things:

1) Introduces the "auto-movable" online policy that initially operates on
individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
to make a decision whether a memory block will be onlined to
ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
memory does not allow for more MOVABLE memory (details in the
patches). CMA memory is treated like MOVABLE memory.

How a user would know which ratio is sane? Could we add some info in the
Docu part that kinda sets some "basic" rules?

Again, currently resides in the memory-hotplug.rst overhaul.

2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
groups and uses group information to make decisions in the
"auto-movable" online policy accross memory blocks of a single memory
device (modeled as memory group).

So, the distinction being that a DIMM cannot grow larger but we can add more
memory to a virtio-mem unit? I feel I am missing some insight here.

Right, the relevant patch contains more info.

You either plug or unplug a DIMM (or a NUMA node which spans multiple DIMMS) -- both are ACPI memory devices that span multiple physical regions. You cannot unplug parts of a DIMM or grow it. "static" as also expressed by ACPI code ("adds" and "removes" all memory device memory in one go).

virtio-mem behaves differently, as it's a single physical memory region in which we dynamically add or remove memory. The granularity in which we add/remove memory from Linux is a "unit". In the simplest case, it's just a single memory block (e.g., 128 MiB). So it's a memory device that can grow/shrink in the given unit -- "dynamic".

3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
allowing ZONE_NORMAL memory within a dynamic memory group to allow for
more ZONE_MOVABLE memory within the same memory group. The target use
case is dynamic VM resizing using virtio-mem.

Sorry, I got lost in this one. Care to explain a bit more?

The virtio-mem example below should make this a bit more clearer (in addition to the relevant patch), especially in contrast to static memory devices like DIMMs. Key is that a single virtio-mem device is a "dynamic memory group" in which memory can get added/removed dynamically in a given unit granularity. And we want to special case that type of device to have as much memory of a virtio-mem device being MOVABLE as possible (and configured).

The target usage will be:

1) Linux boots with "mhp_default_online_type=offline"

2) User space (e.g., systemd unit) configures memory onlining (according
to a config file and system properties), for example:
* Setting memory_hotplug.online_policy=auto-movable
* Setting memory_hotplug.auto_movable_ratio=301
* Setting memory_hotplug.auto_movable_numa_aware=true

I think we would need to document those in order to let the user know what
it is best for them. e.g: when do we want to enable auto_movable_numa_aware etc.

Yes, as mentioned below, an memory-hotplug.rst update will follow once the overhaul is done. The respective patch contains more information.

For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
301% results in the following layout:
Memory block 1-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-79: Movable (DIMM 0)
Memory block 80-111: Movable (DIMM 1)
Memory block 112-143: Movable (DIMM 2)
Memory block 144-275: Normal (DIMM 3)
Memory block 176-207: Normal (DIMM 4)
... all Normal
(-> hotplugged Normal memory does not allow for more Movable memory)

Uhm, I am sorry for being dense here:

On x86_64, 4GB = 32 sections (of 128MB each). Why the memblock span from #1 to #47?

Sorry, it's actually "Memory block 0-15", which gives us 0-15 and 32-47 == 32 memory blocks corresponding to boot memory. Note that the absent memory blocks 16-31 should correspond to the PCI hole.

Thanks Oscar!


David / dhildenb