Re: [PATCH v1] memory-hotplug.rst: complete admin-guide overhaul

From: David Hildenbrand
Date: Tue Jun 08 2021 - 09:04:37 EST


+ZONE_MOVABLE
+============
+
+ZONE_MOVABLE is an important mechanism for more reliable memory offlining.
+Further, having system RAM managed by ZONE_MOVABLE instead of one of the
+kernel zones can increase the number of possible transparent huge pages and
+dynamically allocated huge pages.
+

I'd move the first two paragraphs from "Zone Imbalances" here to provide
some context what is movable and what is unmovable allocation.

Makes sense.

[...]

-How to offline memory
----------------------
+Considerations

ZONE_MOVABLE Sizing Considerations ?


Ack

I'd also move the contents of "Boot Memory and ZONE_MOVABLE" here (with
some adjustments):

By default, all the memory configured at boot time is managed by the kernel
zones and ZONE_MOVABLE is not used.

To enable ZONE_MOVABLE to include the memory present at boot and to
control the ratio between movable and kernel zones there are two command
line options: ``kernelcore=`` and ``movablecore=``. See
Documentation/admin-guide/kernel-parameters.rst for their description.


Makes sense. I'll move it to the end of the "ZONE_MOVABLE Sizing Considerations" section.

+--------------
-You can offline a memory block by using the same sysfs interface that was used
-in memory onlining::
+We usually expect that a large portion of available system RAM will actually
+be consumed by user space, either directly or indirectly via the page cache. In
+the normal case, ZONE_MOVABLE can be used when allocating such pages just fine.
- % echo offline > /sys/devices/system/memory/memoryXXX/state
+With that in mind, it makes sense that we can have a big portion of system RAM
+managed by ZONE_MOVABLE. However, there are some things to consider when
+using ZONE_MOVABLE, especially when fine-tuning zone ratios:
-If offline succeeds, the state of the memory block is changed to be "offline".
-If it fails, some error core (like -EBUSY) will be returned by the kernel.
-Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline
-it. If it doesn't contain 'unmovable' memory, you'll get success.
+- Having a lot of offline memory blocks. Even offline memory blocks consume
+ memory for metadata and page tables in the direct map; having a lot of
+ offline memory blocks is not a typical case, though.
+
+- Memory ballooning. Some memory ballooning implementations, such as
+ the Hyper-V balloon, the XEN balloon, the vbox balloon and the VMWare

So, everyone except virtio-mem? ;-)

Well, virtio-mem does not classify as memory balloon in that sense, as it only operates on own device memory ;)

virtio-balloon and pseries CMM support balloon compaction.

I'd drop the names because if some of those will implement balloon
compaction they surely will forget to update the docs.

I can do the opposite and mention the ones that already do. Some most probably will never support it.

"Memory ballooning without balloon compaction is incompatible with ZONE_MOVABLE. Only some implementations, such as virtio-balloon and pseries CMM, fully support balloon compaction."



+ balloon with huge pages don't support balloon compaction and, thereby
+ ZONE_MOVABLE.
+
+ Further, CONFIG_BALLOON_COMPACTION might be disabled. In that case, balloon
+ inflation will only perform unmovable allocations and silently create a
+ zone imbalance, usually triggered by inflation requests from the
+ hypervisor.
+
+- Gigantic pages are unmovable, resulting in user space consuming a
+ lot of unmovable memory.
+
+- Huge pages are unmovable when an architectures does not support huge
+ page migration, resulting in a similar issue as with gigantic pages.
+
+- Page tables are unmovable. Excessive swapping, mapping extremely large
+ files or ZONE_DEVICE memory can be problematic, although only
+ really relevant in corner cases. When we manage a lot of user space memory
+ that has been swapped out or is served from a file/pmem/... we still need

^ persistent memory

Agreed.


+ a lot of page tables to manage that memory once user space accessed that
+ memory once.
+
+- DAX: when we have a lot of ZONE_DEVICE memory added to the system as DAX
+ and we are not using an altmap to allocate the memmap from device memory
+ directly, we will have to allocate the memmap for this memory from the
+ kernel zones.

I'm not sure admin-guide reader will know when we use altmap when we don't.
Maybe

DAX: in certain DAX configurations the memory map for the device memory will
be allocated from the kernel zones.

Indeed, simpler and communicates the same message.

I'll also add

"KASAN can have a significant memory overhead, for example, consuming 1/8th of the total system memory size as (unmovable) tracking metadata."


Thanks Mike!

--
Thanks,

David / dhildenb