Re: [RFC PATCH] rework memory hotplug onlining
From: Vitaly Kuznetsov
Date: Wed Mar 15 2017 - 06:48:53 EST
Michal Hocko <mhocko@xxxxxxxxxx> writes:
> Hi,
> this is a follow up for [1]. In short the current semantic of the memory
> hotplug is awkward and hard/impossible to use from the udev to online
> memory as movable. The main problem is that only the last memblock or
> the adjacent to highest movable memblock can be onlined as movable:
> : Let's simulate memory hot online manually
> : # echo 0x100000000 > /sys/devices/system/memory/probe
> : # grep . /sys/devices/system/memory/memory32/valid_zones
> : Normal Movable
> :
> : which looks reasonably right? Both Normal and Movable zones are allowed
> :
> : # echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
> : # grep . /sys/devices/system/memory/memory3?/valid_zones
> : /sys/devices/system/memory/memory32/valid_zones:Normal
> : /sys/devices/system/memory/memory33/valid_zones:Normal Movable
> :
> : Huh, so our valid_zones have changed under our feet...
> :
> : # echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
> : # grep . /sys/devices/system/memory/memory3?/valid_zones
> : /sys/devices/system/memory/memory32/valid_zones:Normal
> : /sys/devices/system/memory/memory33/valid_zones:Normal
> : /sys/devices/system/memory/memory34/valid_zones:Normal Movable
> :
> : and again. So only the last memblock is considered movable. Let's try to
> : online them now.
> :
> : # echo online_movable > /sys/devices/system/memory/memory34/state
> : # grep . /sys/devices/system/memory/memory3?/valid_zones
> : /sys/devices/system/memory/memory32/valid_zones:Normal
> : /sys/devices/system/memory/memory33/valid_zones:Normal Movable
> : /sys/devices/system/memory/memory34/valid_zones:Movable Normal
>
> Now consider that the userspace gets the notification when the memblock
> is added. If the udev context tries to online it it will a) race with
> new memblocks showing up which leads to undeterministic behavior and
> b) it will see memblocks ordered in growing physical addresses while
> the only reliable way to online blocks as movable is exactly from other
> directions. This is just plain wrong!
>
> It seems that all this is just started by the semantic introduced by
> 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd without sparsemem")
> quite some time ago. When the movable onlinining has been introduced it
> just built on top of this. It seems that the requirement to have
> freshly probed memory associated with the zone normal is no longer
> necessary. HOTPLUG depends on CONFIG_SPARSEMEM these days.
>
> The following blob [2] simply removes all the zone specific operations
> from __add_pages (aka arch_add_memory) path. Instead we do page->zone
> association from move_pfn_range which is called from online_pages. The
> criterion for movable/normal zone association is really simple now. We
> just have to guarantee that zone Normal is always lower than zone
> Movable. It would be actually sufficient to guarantee they do not
> overlap and that is indeed trivial to implement now. I didn't do that
> yet for simplicity of this change though.
>
> I have lightly tested the patch and nothing really jumped at me. I
> assume there will be some rough edges but it should be sufficient to
> start the discussion at least. Please note the diffstat. We have added
> a lot of code to tweak on top of the previous semantic which is just
> sad. Instead of developing a robust solution the memory hotplug is full
> of tweaks to satisfy particular usecase without longer term plans.
>
> Please note that this is just for x86 now but I will address other
> arches once there is an agreement this is the right approach.
>
> Thoughts, objections?
>
Speaking about long term approach,
(I'm not really familiar with the history of memory zones code so please
bear with me if my questions are stupid)
Currently when we online memory blocks we need to know where to put the
boundary between NORMAL and MOVABLE and this is a very hard decision to
make, no matter if we do this from kernel or from userspace. In theory,
we just want to avoid redundant limitations with future unplug but we
don't really know how much memory we'll need for kernel allocations in
future.
What actually stops us from having the following approach:
1) Everything is added to MOVABLE
2) When we're out of memory for kernel allocations in NORMAL we 'harvest'
the first MOVABLE block and 'convert' it to NORMAL. It may happen that
there is no free pages in this block but it was MOVABLE which means we
can move all allocations somewhere else.
3) Freeing the whole 128mb memblock takes time but we don't need to wait
till it finishes, we just need to satisfy the currently pending
allocation and we can continue moving everything else in the background.
An alternative approach would be to have lists of memblocks which
constitute ZONE_NORMAL and ZONE_MOVABLE instead of a simple 'NORMAL
before MOVABLE' rule we have now but I'm not sure this is a viable
approach with the current code base.
--
Vitaly