Re: [PATCH 1/2] mm: add private lock to serialize memory hotplug operations

From: Rafael J. Wysocki
Date: Thu Mar 09 2017 - 08:44:29 EST


On Thursday, March 09, 2017 02:06:15 PM Heiko Carstens wrote:
> Commit bfc8c90139eb ("mem-hotplug: implement get/put_online_mems")
> introduced new functions get/put_online_mems() and
> mem_hotplug_begin/end() in order to allow similar semantics for memory
> hotplug like for cpu hotplug.
>
> The corresponding functions for cpu hotplug are get/put_online_cpus()
> and cpu_hotplug_begin/done() for cpu hotplug.
>
> The commit however missed to introduce functions that would serialize
> memory hotplug operations like they are done for cpu hotplug with
> cpu_maps_update_begin/done().
>
> This basically leaves mem_hotplug.active_writer unprotected and allows
> concurrent writers to modify it, which may lead to problems as
> outlined by commit f931ab479dd2 ("mm: fix devm_memremap_pages crash,
> use mem_hotplug_{begin, done}").
>
> That commit was extended again with commit b5d24fda9c3d ("mm,
> devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
> done}") which serializes memory hotplug operations for some call
> sites by using the device_hotplug lock.
>
> In addition with commit 3fc21924100b ("mm: validate device_hotplug is
> held for memory hotplug") a sanity check was added to
> mem_hotplug_begin() to verify that the device_hotplug lock is held.

Admittedly, I haven't looked at all of the code paths involved in detail yet,
but there's one concern regarding lock/unlock_device_hotplug().

The actual main purpose of it is to ensure safe removal of devices in cases
when they cannot be removed separately, like when a whole CPU package
(including possibly an entire NUMA node with memory and all) is removed.

One of the code paths doing that is acpi_scan_hot_remove() which first
tries to offline devices slated for removal and then finally removes them.

The reason why this needs to be done in two stages is because the offlining
can fail, in which case we will fail the entire operation, while the final
removal step is, well, final (meaning that the devices are gone after it no
matter what).

This is done under device_hotplug_lock, so that the devices that were taken
offline in stage 1 cannot be brought back online before stage 2 is carried
out entirely, which surely would be bad if it happened.

Now, I'm not sure if removing lock/unlock_device_hotplug() from the code in
question actually affects this mechanism, but this in case it does, it is one
thing to double check before going ahead with this patch.

Thanks,
Rafael