Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

From: Michal Hocko
Date: Tue Sep 11 2018 - 05:16:14 EST


On Mon 10-09-18 15:26:55, Pavel Tatashin wrote:
>
>
> On 9/10/18 10:41 AM, Michal Hocko wrote:
> > On Mon 10-09-18 14:32:16, Pavel Tatashin wrote:
> >> On Mon, Sep 10, 2018 at 10:19 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >>>
> >>> On Mon 10-09-18 14:11:45, Pavel Tatashin wrote:
> >>>> Hi Michal,
> >>>>
> >>>> It is tricky, but probably can be done. Either change
> >>>> memmap_init_zone() or its caller to also cover the ends and starts of
> >>>> unaligned sections to initialize and reserve pages.
> >>>>
> >>>> The same thing would also need to be done in deferred_init_memmap() to
> >>>> cover the deferred init case.
> >>>
> >>> Well, I am not sure TBH. I have to think about that much more. Maybe it
> >>> would be much more simple to make sure that we will never add incomplete
> >>> memblocks and simply refuse them during the discovery. At least for now.
> >>
> >> On x86 memblocks can be upto 2G on machines with over 64G of RAM.
> >
> > sorry I meant pageblock_nr_pages rather than memblocks.
>
> OK. This sound reasonable, but, to be honest I am not sure how to
> achieve this yet, I need to think more about this. In theory, if we have
> sparse memory model, it makes sense to enforce memory alignment to
> section sizes, sounds a lot safer.

Memory hotplug is sparsemem only. You do not have to think about other
memory models fortunately.

> >> Also, memory size is way to easy too change via qemu arguments when VM
> >> starts. If we simply disable unaligned trailing memblocks, I am sure
> >> we would get tons of noise of missing memory.
> >>
> >> I think, adding check_hotplug_memory_range() would work to fix the
> >> immediate problem. But, we do need to figure out a better solution.
> >>
> >> memblock design is based on archaic assumption that hotplug units are
> >> physical dimms. VMs and hypervisors changed all of that, and we can
> >> have much finer hotplug requests on machines with huge DIMMs. Yet, we
> >> do not want to pollute sysfs with millions of tiny memory devices. I
> >> am not sure what a long term proper solution for this problem should
> >> be, but I see that linux hotplug/hotremove subsystems must be
> >> redesigned based on the new requirements.
> >
> > Not an easy task though. Anyway, sparse memory modely is highly based on
> > memory sections so it makes some sense to have hotplug section based as
> > well. Memblocks as a higher logical unit on top of that is kinda hack.
> > The userspace API has never been properly thought through I am afraid.
>
> I agree memoryblock is a hack, it fails to do both things it was
> designed to do:
>
> 1. On bare metal you cannot free a physical dimm of memory using
> memoryblock granularity because memory devices do not equal to physical
> dimms. Thus, if for some reason a particular dimm must be
> remove/replaced, memoryblock does not help us.

agreed

> 2. On machines with hypervisors it fails to provide an adequate
> granularity to add/remove memory.
>
> We should define a new user interface where memory can be added/removed
> at a finer granularity: sparse section size, but without a memory
> devices for each section. We should also provide an optional access to
> legacy interface where memory devices are exported but each is of
> section size.
>
> So, when legacy interface is enabled, current way would work:
>
> echo offline > /sys/devices/system/memory/memoryXXX/state
>
> And new interface would allow us to do something like this:
>
> echo offline 256M > /sys/devices/system/node/nodeXXX/memory
>
> With optional start address for offline memory.
> echo offline [start_pa] size > /sys/devices/system/node/nodeXXX/memory
> start_pa and size must be section size aligned (128M).

I am not sure what is the expected semantic of the version without
start_pa.

> It would probably be a good discussion for the next MM Summit how to
> solve the current memory hotplug interface limitations.

Yes, sounds good to me. In any case let's not pollute this email thread
with this discussion now.
--
Michal Hocko
SUSE Labs