Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks

From: Igor Mammedov
Date: Thu Mar 02 2017 - 08:54:38 EST


On Mon 27-02-17 16:43:04, Michal Hocko wrote:
> On Mon 27-02-17 12:25:10, Heiko Carstens wrote:
> > On Mon, Feb 27, 2017 at 11:02:09AM +0100, Vitaly Kuznetsov wrote:
> > > A couple of other thoughts:
> > > 1) Having all newly added memory online ASAP is probably what people
> > > want for all virtual machines.
> >
> > This is not true for s390. On s390 we have "standby" memory that a guest
> > sees and potentially may use if it sets it online. Every guest that sets
> > memory offline contributes to the hypervisor's standby memory pool, while
> > onlining standby memory takes memory away from the standby pool.
> >
> > The use-case is that a system administrator in advance knows the maximum
> > size a guest will ever have and also defines how much memory should be used
> > at boot time. The difference is standby memory.
> >
> > Auto-onlining of standby memory is the last thing we want.
I don't know much about anything other than x86 so all comments
below are from that point of view,
archetectures that don't need auto online can keep current default

> > > Unfortunately, we have additional complexity with memory zones
> > > (ZONE_NORMAL, ZONE_MOVABLE) and in some cases manual intervention is
> > > required. Especially, when further unplug is expected.
> >
> > This also is a reason why auto-onlining doesn't seem be the best way.

When trying to support memory unplug on guest side in RHEL7,
experience shows otherwise. Simplistic udev rule which onlines
added block doesn't work in case one wants to online it as movable.

Hotplugged blocks in current kernel should be onlined in reverse
order to online blocks as movable depending on adjacent blocks zone.
Which means simple udev rule isn't usable since it gets event from
the first to the last hotplugged block order. So now we would have
to write a daemon that would
- watch for all blocks in hotplugged memory appear (how would it know)
- online them in right order (order might also be different depending
on kernel version)
-- it becomes even more complicated in NUMA case when there are
multiple zones and kernel would have to provide user-space
with information about zone maps

In short current experience shows that userspace approach
- doesn't solve issues that Vitaly has been fixing (i.e. onlining
fast and/or under memory pressure) when udev (or something else
might be killed)
- doesn't reduce overall system complexity, it only gets worse
as user-space handler needs to know a lot about kernel internals
and implementation details/kernel versions to work properly

It's might be not easy but doing onlining in kernel on the other hand is:
- faster
- more reliable (can't be killed under memory pressure)
- kernel has access to all info needed for onlining and how it
internals work to implement auto-online correctly
- since there is no need to mantain ABI for user-space
(zones layout/ordering/maybe something else), kernel is
free change internal implemetation without breaking userspace
(currently hotplug+unplug doesn't work reliably and we might
need something more flexible than zones)
That's direction of research in progress, i.e. making kernel
implementation better instead of putting responsibility on
user-space to deal with kernel shortcomings.

> Can you imagine any situation when somebody actually might want to have
> this knob enabled? From what I understand it doesn't seem to be the
> case.
For x86:
* this config option is enabled by default in recent Fedora,
* RHEL6 ships similar downstream patches to do the same thing for years
* RHEL7 has udev rule (because there wasn't kernel side solution at fork time)
that auto-onlines it unconditionally, Vitaly might backport it later
when he has time.
Not linux kernel but still auto online policy is used by Windows
both on baremetal and guest configurations.

That's somewhat shows that current defaults upstream on x86
might be not what end-users wish for.

When auto_online_blocks were introduced, Vitaly has been
conservative and left current upstream defaults where they were
lest it would break someone else setup but allowing downstreams
set their own auto-online policy, eventually we might switch it
upstream too.