Re: [v5 0/3] "Hotremove" persistent memory

From: David Hildenbrand
Date: Mon May 20 2019 - 04:01:09 EST


On 17.05.19 16:09, Pavel Tatashin wrote:
>>
>> I would think that ACPI hotplug would have a similar problem, but it does this:
>>
>> acpi_unbind_memory_blocks(info);
>> __remove_memory(nid, info->start_addr, info->length);
>
> ACPI does have exactly the same problem, so this is not a bug for this
> series, I will submit a new version of my series with comments
> addressed, but without fix for this issue.
>
> I was able to reproduce this issue on the current mainline kernel.
> Also, I been thinking more about how to fix it, and there is no easy
> fix without a major hotplug redesign. Basically, we have to remove
> sysfs memory entries before or after memory is hotplugged/hotremoved.
> But, we also have to guarantee that hotplug/hotremove will succeed or
> reinstate sysfs entries.
>
> Qemu script:
>
> qemu-system-x86_64 \
> -enable-kvm \
> -cpu host \
> -parallel none \
> -echr 1 \
> -serial none \
> -chardev stdio,id=console,signal=off,mux=on \
> -serial chardev:console \
> -mon chardev=console \
> -vga none \
> -display none \
> -kernel pmem/native/arch/x86/boot/bzImage \
> -m 8G,slots=1,maxmem=16G \
> -smp 8 \
> -fsdev local,id=virtfs1,path=/,security_model=none \
> -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs \
> -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
> TERM=xterm ip=dhcp loglevel=7'
>
> Config is attached.
>
> Steps to reproduce:
> #
> # QEMU 4.0.0 monitor - type 'help' for more information
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> (qemu)
>
> # echo online_movable > /sys/devices/system/memory/memory79/state
> [ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370
> [ 23.032591] Policy zone: Normal
> # (qemu) device_del dimm1
> (qemu) [ 32.013950] Offlined Pages 32768
> [ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022
> [ 32.014843] Policy zone: Normal
> [ 32.015733]
> [ 32.015881] ======================================================
> [ 32.016390] WARNING: possible circular locking dependency detected
> [ 32.016881] 5.1.0_pt_pmem #38 Not tainted
> [ 32.017202] ------------------------------------------------------
> [ 32.017680] kworker/u16:4/380 is trying to acquire lock:
> [ 32.018096] 00000000675cc7e1 (kn->count#18){++++}, at:
> kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.018745]
> [ 32.018745] but task is already holding lock:
> [ 32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
> unregister_memory_section+0x1d/0xa0
> [ 32.019859]
> [ 32.019859] which lock already depends on the new lock.
> [ 32.019859]
> [ 32.020499]
> [ 32.020499] the existing dependency chain (in reverse order) is:
> [ 32.021080]
> [ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
> [ 32.021522] __mutex_lock+0x8b/0x900
> [ 32.021843] hotplug_memory_register+0x26/0xa0
> [ 32.022231] __add_pages+0xe7/0x160
> [ 32.022545] add_pages+0xd/0x60
> [ 32.022835] add_memory_resource+0xc3/0x1d0
> [ 32.023207] __add_memory+0x57/0x80
> [ 32.023530] acpi_memory_device_add+0x13a/0x2d0
> [ 32.023928] acpi_bus_attach+0xf1/0x200
> [ 32.024272] acpi_bus_scan+0x3e/0x90
> [ 32.024597] acpi_device_hotplug+0x284/0x3e0
> [ 32.024972] acpi_hotplug_work_fn+0x15/0x20
> [ 32.025342] process_one_work+0x2a0/0x650
> [ 32.025755] worker_thread+0x34/0x3d0
> [ 32.026077] kthread+0x118/0x130
> [ 32.026442] ret_from_fork+0x3a/0x50
> [ 32.026766]
> [ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}:
> [ 32.027261] get_online_mems+0x39/0x80
> [ 32.027600] kmem_cache_create_usercopy+0x29/0x2c0
> [ 32.028019] kmem_cache_create+0xd/0x10
> [ 32.028367] ptlock_cache_init+0x1b/0x23
> [ 32.028724] start_kernel+0x1d2/0x4b8
> [ 32.029060] secondary_startup_64+0xa4/0xb0
> [ 32.029447]
> [ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}:
> [ 32.030007] cpus_read_lock+0x39/0x80
> [ 32.030360] __offline_pages+0x32/0x790
> [ 32.030709] memory_subsys_offline+0x3a/0x60
> [ 32.031089] device_offline+0x7e/0xb0
> [ 32.031425] acpi_bus_offline+0xd8/0x140
> [ 32.031821] acpi_device_hotplug+0x1b2/0x3e0
> [ 32.032202] acpi_hotplug_work_fn+0x15/0x20
> [ 32.032576] process_one_work+0x2a0/0x650
> [ 32.032942] worker_thread+0x34/0x3d0
> [ 32.033283] kthread+0x118/0x130
> [ 32.033588] ret_from_fork+0x3a/0x50
> [ 32.033919]
> [ 32.033919] -> #1 (&device->physical_node_lock){+.+.}:
> [ 32.034450] __mutex_lock+0x8b/0x900
> [ 32.034784] acpi_get_first_physical_node+0x16/0x60
> [ 32.035217] acpi_companion_match+0x3b/0x60
> [ 32.035594] acpi_device_uevent_modalias+0x9/0x20
> [ 32.036012] platform_uevent+0xd/0x40
> [ 32.036352] dev_uevent+0x85/0x1c0
> [ 32.036674] kobject_uevent_env+0x1e2/0x640
> [ 32.037044] kobject_synth_uevent+0x2b7/0x324
> [ 32.037428] uevent_store+0x17/0x30
> [ 32.037752] kernfs_fop_write+0xeb/0x1a0
> [ 32.038112] vfs_write+0xb2/0x1b0
> [ 32.038417] ksys_write+0x57/0xd0
> [ 32.038721] do_syscall_64+0x4b/0x1a0
> [ 32.039053] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 32.039491]
> [ 32.039491] -> #0 (kn->count#18){++++}:
> [ 32.039913] lock_acquire+0xaa/0x180
> [ 32.040242] __kernfs_remove+0x244/0x2d0
> [ 32.040593] kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.040991] device_del+0x14a/0x370
> [ 32.041309] device_unregister+0x9/0x20
> [ 32.041653] unregister_memory_section+0x69/0xa0
> [ 32.042059] __remove_pages+0x112/0x460
> [ 32.042402] arch_remove_memory+0x6f/0xa0
> [ 32.042758] __remove_memory+0xab/0x130
> [ 32.043103] acpi_memory_device_remove+0x67/0xe0
> [ 32.043537] acpi_bus_trim+0x50/0x90
> [ 32.043889] acpi_device_hotplug+0x2fa/0x3e0
> [ 32.044300] acpi_hotplug_work_fn+0x15/0x20
> [ 32.044686] process_one_work+0x2a0/0x650
> [ 32.045044] worker_thread+0x34/0x3d0
> [ 32.045381] kthread+0x118/0x130
> [ 32.045679] ret_from_fork+0x3a/0x50
> [ 32.046005]
> [ 32.046005] other info that might help us debug this:
> [ 32.046005]
> [ 32.046636] Chain exists of:
> [ 32.046636] kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
> [ 32.046636]
> [ 32.047514] Possible unsafe locking scenario:
> [ 32.047514]
> [ 32.047976] CPU0 CPU1
> [ 32.048337] ---- ----
> [ 32.048697] lock(mem_sysfs_mutex);
> [ 32.048983] lock(mem_hotplug_lock.rw_sem);
> [ 32.049519] lock(mem_sysfs_mutex);
> [ 32.050004] lock(kn->count#18);
> [ 32.050270]
> [ 32.050270] *** DEADLOCK ***
> [ 32.050270]
> [ 32.050736] 7 locks held by kworker/u16:4/380:
> [ 32.051087] #0: 00000000a22fe78e
> ((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650
> [ 32.051830] #1: 00000000944f2dca
> ((work_completion)(&hpw->work)){+.+.}, at:
> process_one_work+0x21e/0x650
> [ 32.052577] #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at:
> acpi_device_hotplug+0x2e/0x3e0
> [ 32.053271] #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at:
> acpi_device_hotplug+0x3c/0x3e0
> [ 32.053916] #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++},
> at: __remove_memory+0x3b/0x130
> [ 32.054602] #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++},
> at: percpu_down_write+0x1d/0x110
> [ 32.055315] #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
> unregister_memory_section+0x1d/0xa0
> [ 32.056016]
> [ 32.056016] stack backtrace:
> [ 32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38
> [ 32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.12.0-20181126_142135-anatol 04/01/2014
> [ 32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [ 32.058144] Call Trace:
> [ 32.058344] dump_stack+0x67/0x90
> [ 32.058604] print_circular_bug.cold.60+0x15c/0x195
> [ 32.058989] __lock_acquire+0x17de/0x1d30
> [ 32.059308] ? find_held_lock+0x2d/0x90
> [ 32.059611] ? __kernfs_remove+0x199/0x2d0
> [ 32.059937] lock_acquire+0xaa/0x180
> [ 32.060223] ? kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.060596] __kernfs_remove+0x244/0x2d0
> [ 32.060908] ? kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.061283] ? kernfs_name_hash+0xd/0x80
> [ 32.061596] ? kernfs_find_ns+0x68/0xf0
> [ 32.061907] kernfs_remove_by_name_ns+0x3b/0x80
> [ 32.062266] device_del+0x14a/0x370
> [ 32.062548] ? unregister_mem_sect_under_nodes+0x4f/0xc0
> [ 32.062973] device_unregister+0x9/0x20
> [ 32.063285] unregister_memory_section+0x69/0xa0
> [ 32.063651] __remove_pages+0x112/0x460
> [ 32.063949] arch_remove_memory+0x6f/0xa0
> [ 32.064271] __remove_memory+0xab/0x130
> [ 32.064579] ? walk_memory_range+0xa1/0xe0
> [ 32.064907] acpi_memory_device_remove+0x67/0xe0
> [ 32.065274] acpi_bus_trim+0x50/0x90
> [ 32.065560] acpi_device_hotplug+0x2fa/0x3e0
> [ 32.065900] acpi_hotplug_work_fn+0x15/0x20
> [ 32.066249] process_one_work+0x2a0/0x650
> [ 32.066591] worker_thread+0x34/0x3d0
> [ 32.066925] ? process_one_work+0x650/0x650
> [ 32.067275] kthread+0x118/0x130
> [ 32.067542] ? kthread_create_on_node+0x60/0x60
> [ 32.067909] ret_from_fork+0x3a/0x50
>
>>
>> I wonder if that ordering prevents going too deep into the
>> device_unregister() call stack that you highlighted below.
>>
>>
>>>
>>> Here is the problem:
>>>
>>> When we offline pages we have the following call stack:
>>>
>>> # echo offline > /sys/devices/system/memory/memory8/state
>>> ksys_write
>>> vfs_write
>>> __vfs_write
>>> kernfs_fop_write
>>> kernfs_get_active
>>> lock_acquire kn->count#122 (lock for
>>> "memory8/state" kn)
>>> sysfs_kf_write
>>> dev_attr_store
>>> state_store
>>> device_offline
>>> memory_subsys_offline
>>> memory_block_action
>>> offline_pages
>>> __offline_pages
>>> percpu_down_write
>>> down_write
>>> lock_acquire mem_hotplug_lock.rw_sem
>>>
>>> When we unbind dax0.0 we have the following stack:
>>> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
>>> drv_attr_store
>>> unbind_store
>>> device_driver_detach
>>> device_release_driver_internal
>>> dev_dax_kmem_remove
>>> remove_memory device_hotplug_lock
>>> try_remove_memory mem_hotplug_lock.rw_sem
>>> arch_remove_memory
>>> __remove_pages
>>> __remove_section
>>> unregister_memory_section
>>> remove_memory_section mem_sysfs_mutex
>>> unregister_memory
>>> device_unregister
>>> device_del
>>> device_remove_attrs
>>> sysfs_remove_groups
>>> sysfs_remove_group
>>> remove_files
>>> kernfs_remove_by_name
>>> kernfs_remove_by_name_ns
>>> __kernfs_remove kn->count#122
>>>
>>> So, lockdep found the ordering issue with the above two stacks:
>>>
>>> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
>>> 2. mem_hotplug_lock.rw_sem -> kn->count#122

I once documented locking behavior in

Documentation/core-api/memory-hotplug.rst

Both, device_online() and __remove_memory() always have to be called
holding the device_hotplug_lock(), to avoid such races.

# echo offline > /sys/devices/system/memory/memory8/state
and
# echo 0 > /sys/devices/system/memory/memory8/online

either end up in:

-> online_store()
-- lock_device_hotplug_sysfs();
-- device_offline(dev)

-> state_store()
-- lock_device_hotplug_sysfs();
-- device_online(&mem->dev);

So the device_hotplug_lock prohibits the race you describe.

BUT

There is a possible race between the device_hotplug_lock and the
kn->count#122. However, that race can never trigger, as userspace
properly backs off in case it cannot get grip of the device_hotplug_lock.

(lock_device_hotplug_sysfs does a mutex_trylock(&device_hotplug_lock))

--

Thanks,

David / dhildenb