Re: [PATCH 1/2] mm: add private lock to serialize memory hotplug operations
From: Rafael J. Wysocki
Date: Thu Mar 09 2017 - 17:40:26 EST
On Thursday, March 09, 2017 02:33:43 PM Dan Williams wrote:
> On Thu, Mar 9, 2017 at 2:15 PM, Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote:
> > On Thursday, March 09, 2017 10:10:31 AM Dan Williams wrote:
> >> On Thu, Mar 9, 2017 at 5:39 AM, Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote:
> >> > On Thursday, March 09, 2017 02:06:15 PM Heiko Carstens wrote:
> >> >> Commit bfc8c90139eb ("mem-hotplug: implement get/put_online_mems")
> >> >> introduced new functions get/put_online_mems() and
> >> >> mem_hotplug_begin/end() in order to allow similar semantics for memory
> >> >> hotplug like for cpu hotplug.
> >> >>
> >> >> The corresponding functions for cpu hotplug are get/put_online_cpus()
> >> >> and cpu_hotplug_begin/done() for cpu hotplug.
> >> >>
> >> >> The commit however missed to introduce functions that would serialize
> >> >> memory hotplug operations like they are done for cpu hotplug with
> >> >> cpu_maps_update_begin/done().
> >> >>
> >> >> This basically leaves mem_hotplug.active_writer unprotected and allows
> >> >> concurrent writers to modify it, which may lead to problems as
> >> >> outlined by commit f931ab479dd2 ("mm: fix devm_memremap_pages crash,
> >> >> use mem_hotplug_{begin, done}").
> >> >>
> >> >> That commit was extended again with commit b5d24fda9c3d ("mm,
> >> >> devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
> >> >> done}") which serializes memory hotplug operations for some call
> >> >> sites by using the device_hotplug lock.
> >> >>
> >> >> In addition with commit 3fc21924100b ("mm: validate device_hotplug is
> >> >> held for memory hotplug") a sanity check was added to
> >> >> mem_hotplug_begin() to verify that the device_hotplug lock is held.
> >> >
> >> > Admittedly, I haven't looked at all of the code paths involved in detail yet,
> >> > but there's one concern regarding lock/unlock_device_hotplug().
> >> >
> >> > The actual main purpose of it is to ensure safe removal of devices in cases
> >> > when they cannot be removed separately, like when a whole CPU package
> >> > (including possibly an entire NUMA node with memory and all) is removed.
> >> >
> >> > One of the code paths doing that is acpi_scan_hot_remove() which first
> >> > tries to offline devices slated for removal and then finally removes them.
> >> >
> >> > The reason why this needs to be done in two stages is because the offlining
> >> > can fail, in which case we will fail the entire operation, while the final
> >> > removal step is, well, final (meaning that the devices are gone after it no
> >> > matter what).
> >> >
> >> > This is done under device_hotplug_lock, so that the devices that were taken
> >> > offline in stage 1 cannot be brought back online before stage 2 is carried
> >> > out entirely, which surely would be bad if it happened.
> >> >
> >> > Now, I'm not sure if removing lock/unlock_device_hotplug() from the code in
> >> > question actually affects this mechanism, but this in case it does, it is one
> >> > thing to double check before going ahead with this patch.
> >> >
> >>
> >> I *think* we're ok in this case because unplugging the CPU package
> >> that contains a persistent memory device will trigger
> >> devm_memremap_pages() to call arch_remove_memory(). Removing a pmem
> >> device can't fail. It may be held off while pages are pinned for DMA
> >> memory, but it will eventually complete.
> >
> > What about the offlining, though? Is it guaranteed that no memory from those
> > ranges will go back online after the acpi_scan_try_to_offline() call in
> > acpi_scan_hot_remove()?
>
> The memory described by devm_memremap_pages() is never "onlined" to
> the core mm. We're only using arch_add_memory() to get a linear
> mapping and page structures. The rest of memory hotplug is skipped,
> and this ZONE_DEVICE memory is otherwise hidden from the core mm.
OK, that should be fine then.
> Are ACPI devices disabled by this point? For example, If we have
> disabled the nfit bus device (_HID ACPI0012) then the associated child
> pmem device(s) will be gone and not coming back.
We call acpi_bus_trim() on the root of the subtree in question before calling
acpi_evaluat_ej0(), so the driver's ->remove() should be called before that,
but it can't leave any delayed works behind.
> Now, that said, the ACPI0012 bus device is global for the entire
> system. So we'd need more plumbing to target the pmem on a given
> socket without touching the others.
Well, it's all a bit academic at this point AFAICS.
Thanks,
Rafael