[RFC PATCH] mm, memory_hotplug: support movable_node for hotplugable nodes

From: Michal Hocko
Date: Thu Jun 01 2017 - 08:20:35 EST


From: Michal Hocko <mhocko@xxxxxxxx>

movable_node kernel parameter allows to make hotplugable NUMA
nodes to put all the hotplugable memory into movable zone which
allows more or less reliable memory hotremove. At least this
is the case for the NUMA nodes present during the boot (see
find_zone_movable_pfns_for_nodes).

This is not the case for the memory hotplug, though.

echo online > /sys/devices/system/memory/memoryXYZ/status

will default to a kernel zone (usually ZONE_NORMAL) unless the
particular memblock is already in the movable zone range which is not
the case normally when onlining the memory from the udev rule context
for a freshly hotadded NUMA node. The only option currently is to have a
special udev rule to echo online_movable to all memblocks belonging to
such a node which is rather clumsy. Not the mention this is inconsistent
as well because what ended up in the movable zone during the boot will
end up in a kernel zone after hotremove & hotadd without special care.

It would be nice to reuse memblock_is_hotpluggable but the runtime
hotplug doesn't have that information available because the boot and
hotplug paths are not shared and it would be really non trivial to
make them use the same code path because the runtime hotplug doesn't
play with the memblock allocator at all.

Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
movable_node is enabled and the range doesn't overlap with the existing
normal zone. This should provide a reasonable default onlining strategy.

Strictly speaking the semantic is not identical with the boot time
initialization because find_zone_movable_pfns_for_nodes covers only the
hotplugable range as described by the BIOS/FW. From my experience this
is usually a full node though (except for Node0 which is special and
never goes away completely). If this turns out to be a problem in the
real life we can tweak the code to store hotplug flag into memblocks
but let's keep this simple now.

Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
---

Hi,
I am sending this as an RFC because this is a user visible change change
of behavior, strictly speaking. I believe it is a desirable change of
behavior, thought, and it an explicit opt-in (kernel parameter) is
required to see the change so I do not expect any breakage. I would
still like to hear what other people think about this shift. I have
tested it on a memory hotplug capable HW where the whole numa node can
be hotremove/added.

Does anybody see any problem with the proposed semantic?

mm/memory_hotplug.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b98fb0b3ae11..74d75583736c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -943,6 +943,19 @@ struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
return &pgdat->node_zones[ZONE_NORMAL];
}

+static inline bool movable_pfn_range(int nid, struct zone *default_zone,
+ unsigned long start_pfn, unsigned long nr_pages)
+{
+ if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
+ MMOP_ONLINE_KERNEL))
+ return true;
+
+ if (!movable_node_is_enabled())
+ return false;
+
+ return !zone_intersects(default_zone, start_pfn, nr_pages);
+}
+
/*
* Associates the given pfn range with the given node and the zone appropriate
* for the given online type.
@@ -958,10 +971,10 @@ static struct zone * __meminit move_pfn_range(int online_type, int nid,
/*
* MMOP_ONLINE_KEEP defaults to MMOP_ONLINE_KERNEL but use
* movable zone if that is not possible (e.g. we are within
- * or past the existing movable zone)
+ * or past the existing movable zone). movable_node overrides
+ * this default and defaults to movable zone
*/
- if (!allow_online_pfn_range(nid, start_pfn, nr_pages,
- MMOP_ONLINE_KERNEL))
+ if (movable_pfn_range(nid, zone, start_pfn, nr_pages))
zone = movable_zone;
} else if (online_type == MMOP_ONLINE_MOVABLE) {
zone = &pgdat->node_zones[ZONE_MOVABLE];
--
2.11.0