Re: [PATCH v1] memory-hotplug.rst: complete admin-guide overhaul

From: Mike Rapoport
Date: Mon Jun 07 2021 - 07:35:58 EST


On Tue, May 25, 2021 at 12:26:04PM +0200, David Hildenbrand wrote:
> The memory hot(un)plug documentation is outdated and incomplete. Most of
> the content dates back to 2007, so it's time for a major overhaul.
>
> Let's rewrite, reorganize and update most parts of the documentation. In
> addition to memory hot(un)plug, also add some details regarding
> ZONE_MOVABLE, with memory hotunplug being one of its main consumers.
>
> The style of the document is also properly fixed that e.g., "restview"
> renders it cleanly now.
>
> In the future, we might add some more details about virt users like
> virtio-mem, the XEN balloon, the Hyper-V balloon and ppc64 dlpar.
>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Oscar Salvador <osalvador@xxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxx>
> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> Cc: Mike Rapoport <rppt@xxxxxxxxxx>
> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Cc: Anshuman Khandual <anshuman.khandual@xxxxxxx>
> Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> Cc: Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>
> Cc: Jonathan Corbet <corbet@xxxxxxx>
> Cc: Stephen Rothwell <sfr@xxxxxxxxxxxxxxxx>
> Cc: linux-doc@xxxxxxxxxxxxxxx
> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
> ---
>
> Based on linux-next, which includes hugetlb vmemmap changes to the doc
> that are not upstream yet.
>
> ---
> .../admin-guide/mm/memory-hotplug.rst | 738 +++++++++++-------
> 1 file changed, 440 insertions(+), 298 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
> index c6bae2d77160..c95f5c2b30dd 100644
> --- a/Documentation/admin-guide/mm/memory-hotplug.rst
> +++ b/Documentation/admin-guide/mm/memory-hotplug.rst

...

> +ZONE_MOVABLE
> +============
> +
> +ZONE_MOVABLE is an important mechanism for more reliable memory offlining.
> +Further, having system RAM managed by ZONE_MOVABLE instead of one of the
> +kernel zones can increase the number of possible transparent huge pages and
> +dynamically allocated huge pages.
> +

I'd move the first two paragraphs from "Zone Imbalances" here to provide
some context what is movable and what is unmovable allocation.

> +Only movable allocations are served from ZONE_MOVABLE, resulting in
> +unmovable allocations being limited to the kernel zones. Without ZONE_MOVABLE,
> +there is absolutely no guarantee whether a memory block can be offlined
> +successfully.
> +
> +Zone Imbalances
> +---------------
> +
> +Most kernel allocations are unmovable. Important examples include the memmap
> +(usually 1/64 of memory), page tables, and kmalloc(). Such allocations
> +can only be served from the kernel zones.
> +
> +Most user space pages, such as anonymous memory, and page cache pages
> +are movable. Such allocations can be served from ZONE_MOVABLE and the kernel
> +zones.
> +
> +Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance,
> +which can harm the system or degrade performance. As one example, the kernel
> +might crash because it runs out of free memory for unmovable allocations,
> +although there is still plenty of free memory left in ZONE_MOVABLE.
> +
> +Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1
> +are definitely impossible due to the memmap overhead.
> +
> +Actual safe zone ratios depend on the workload. Extreme cases, like excessive
> +long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all.
>
> .. note::
> - Techniques that rely on long-term pinnings of memory (especially, RDMA and
> - vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
> - hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
> - memory can still get hot removed - be aware that pinning can fail even if
> - there is plenty of free memory in ZONE_MOVABLE. In addition, using
> - ZONE_MOVABLE might make page pinning more expensive, because pages have to be
> - migrated off that zone first.
>
> -.. _memory_hotplug_how_to_offline_memory:
> + CMA memory part of a kernel zone essentially behaves like memory in
> + ZONE_MOVABLE and similar considerations apply, especially when combining
> + CMA with ZONE_MOVABLE.
>
> -How to offline memory
> ----------------------
> +Considerations

ZONE_MOVABLE Sizing Considerations ?

I'd also move the contents of "Boot Memory and ZONE_MOVABLE" here (with
some adjustments):

By default, all the memory configured at boot time is managed by the kernel
zones and ZONE_MOVABLE is not used.

To enable ZONE_MOVABLE to include the memory present at boot and to
control the ratio between movable and kernel zones there are two command
line options: ``kernelcore=`` and ``movablecore=``. See
Documentation/admin-guide/kernel-parameters.rst for their description.

> +--------------
>
> -You can offline a memory block by using the same sysfs interface that was used
> -in memory onlining::
> +We usually expect that a large portion of available system RAM will actually
> +be consumed by user space, either directly or indirectly via the page cache. In
> +the normal case, ZONE_MOVABLE can be used when allocating such pages just fine.
>
> - % echo offline > /sys/devices/system/memory/memoryXXX/state
> +With that in mind, it makes sense that we can have a big portion of system RAM
> +managed by ZONE_MOVABLE. However, there are some things to consider when
> +using ZONE_MOVABLE, especially when fine-tuning zone ratios:
>
> -If offline succeeds, the state of the memory block is changed to be "offline".
> -If it fails, some error core (like -EBUSY) will be returned by the kernel.
> -Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline
> -it. If it doesn't contain 'unmovable' memory, you'll get success.
> +- Having a lot of offline memory blocks. Even offline memory blocks consume
> + memory for metadata and page tables in the direct map; having a lot of
> + offline memory blocks is not a typical case, though.
> +
> +- Memory ballooning. Some memory ballooning implementations, such as
> + the Hyper-V balloon, the XEN balloon, the vbox balloon and the VMWare

So, everyone except virtio-mem? ;-)
I'd drop the names because if some of those will implement balloon
compaction they surely will forget to update the docs.

> + balloon with huge pages don't support balloon compaction and, thereby
> + ZONE_MOVABLE.
> +
> + Further, CONFIG_BALLOON_COMPACTION might be disabled. In that case, balloon
> + inflation will only perform unmovable allocations and silently create a
> + zone imbalance, usually triggered by inflation requests from the
> + hypervisor.
> +
> +- Gigantic pages are unmovable, resulting in user space consuming a
> + lot of unmovable memory.
> +
> +- Huge pages are unmovable when an architectures does not support huge
> + page migration, resulting in a similar issue as with gigantic pages.
> +
> +- Page tables are unmovable. Excessive swapping, mapping extremely large
> + files or ZONE_DEVICE memory can be problematic, although only
> + really relevant in corner cases. When we manage a lot of user space memory
> + that has been swapped out or is served from a file/pmem/... we still need

^ persistent memory

> + a lot of page tables to manage that memory once user space accessed that
> + memory once.
> +
> +- DAX: when we have a lot of ZONE_DEVICE memory added to the system as DAX
> + and we are not using an altmap to allocate the memmap from device memory
> + directly, we will have to allocate the memmap for this memory from the
> + kernel zones.

I'm not sure admin-guide reader will know when we use altmap when we don't.
Maybe

DAX: in certain DAX configurations the memory map for the device memory will
be allocated from the kernel zones.

> -A memory block under ZONE_MOVABLE is considered to be able to be offlined
> -easily. But under some busy state, it may return -EBUSY. Even if a memory
> -block cannot be offlined due to -EBUSY, you can retry offlining it and may be
> -able to offline it (or not). (For example, a page is referred to by some kernel
> -internal call and released soon.)
> +- Long-term pinning of pages. Techniques that rely on long-term pinnings
> + (especially, RDMA and vfio/mdev) are fundamentally problematic with
> + ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside
> + on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they
> + have to be migrated off that zone while pinning. Pinning a page can fail
> + even if there is plenty of free memory in ZONE_MOVABLE.
>
> -Consideration:
> - Memory hotplug's design direction is to make the possibility of memory
> - offlining higher and to guarantee unplugging memory under any situation. But
> - it needs more work. Returning -EBUSY under some situation may be good because
> - the user can decide to retry more or not by himself. Currently, memory
> - offlining code does some amount of retry with 120 seconds timeout.
> + In addition, using ZONE_MOVABLE might make page pinning more expensive,
> + because of the page migration overhead.
>
> -Physical memory remove
> -======================
> +Boot Memory and ZONE_MOVABLE
> +----------------------------
>
> -Need more implementation yet....
> - - Notification completion of remove works by OS to firmware.
> - - Guard from remove if not yet.
> +Without further configuration, all boot memory will be managed by kernel zones
> +when booting up in most configurations. ZONE_MOVABLE is not used as default.
>
> +However, there is a mechanism to configure that behavior during boot via the
> +cmdline: ``kernelcore=`` and ``movablecore=``. See
> +Documentation/admin-guide/kernel-parameters.rst for details.
> +
> +Memory Offlining and ZONE_MOVABLE
> +---------------------------------
> +
> +Even with ZONE_MOVABLE, there are some corner cases where offlining a memory
> +block might fail:
> +
> +- Memory blocks with memory holes; this applies to memory blocks present during
> + boot and can apply to memory blocks hotplugged via the XEN balloon and the
> + Hyper-V balloon.
> +
> +- Mixed NUMA nodes and mixed zones within a single memory block prevent memory
> + offlining; this applies to memory blocks present during boot only.
> +
> +- Special memory blocks prevented by the system from getting offlined. Examples
> + include any memory available during boot on arm64 or memory blocks spanning
> + the crashkernel area on s390x; this usually applies to memory blocks present
> + during boot only.
> +
> +- Memory blocks overlapping with CMA areas cannot be offlined, this applies to
> + memory blocks present during boot only.
> +
> +- Concurrent activity that operates on the same physical memory area, such as
> + allocating gigantic pages, can result in temporary offlining failures.
> +
> +- Out of memory when dissolving huge pages, especially when freeing unused
> + vmemmap pages associated with each hugetlb page is enabled.
> +
> + Offlining code may be able to migrate huge page contents, but may not be able
> + to dissolve the source huge page because it fails allocating (unmovable) pages
> + for the vmemmap, because the system might not have free memory in the kernel
> + zones left.
> +
> + Users that depend on memory hotplug to succeed for movable zones should
> + carefully consider whether the memory savings gained from this feature are
> + worth the risk of possibly not being able to offline memory in certain
> + situations.
> +
> +Further, when running into out of memory situations while migrating pages, or
> +when still encountering permanently unmovable pages within ZONE_MOVABLE
> +(-> BUG), memory offlining will keep retrying until it eventually succeeds.
>
> Locking Internals
> =================
> @@ -440,8 +594,8 @@ As the device is visible to user space before taking the device_lock(), this
> can result in a lock inversion.
>
> onlining/offlining of memory should be done via device_online()/
> -device_offline() - to make sure it is properly synchronized to actions
> -via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
> +device_offline() -- to make sure it is properly synchronized to actions
> +via sysfs -- while holding the device_hotplug_lock.
>
> When adding/removing/onlining/offlining memory or adding/removing
> heterogeneous/device memory, we should always hold the mem_hotplug_lock in
> @@ -452,15 +606,3 @@ In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
> mode allows for a quite efficient get_online_mems/put_online_mems
> implementation, so code accessing memory can protect from that memory
> vanishing.
> -
> -
> -Future Work
> -===========
> -
> - - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
> - sysctl or new control file.
> - - showing memory block and physical device relationship.
> - - test and make it better memory offlining.
> - - support HugeTLB page migration and offlining.
> - - memmap removing at memory offline.
> - - physical remove memory.
> --
> 2.31.1
>

--
Sincerely yours,
Mike.