Re: [v5 0/3] "Hotremove" persistent memory
From: Pavel Tatashin
Date: Fri May 17 2019 - 10:12:19 EST
>
> I would think that ACPI hotplug would have a similar problem, but it does this:
>
> acpi_unbind_memory_blocks(info);
> __remove_memory(nid, info->start_addr, info->length);
ACPI does have exactly the same problem, so this is not a bug for this
series, I will submit a new version of my series with comments
addressed, but without fix for this issue.
I was able to reproduce this issue on the current mainline kernel.
Also, I been thinking more about how to fix it, and there is no easy
fix without a major hotplug redesign. Basically, we have to remove
sysfs memory entries before or after memory is hotplugged/hotremoved.
But, we also have to guarantee that hotplug/hotremove will succeed or
reinstate sysfs entries.
Qemu script:
qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-parallel none \
-echr 1 \
-serial none \
-chardev stdio,id=console,signal=off,mux=on \
-serial chardev:console \
-mon chardev=console \
-vga none \
-display none \
-kernel pmem/native/arch/x86/boot/bzImage \
-m 8G,slots=1,maxmem=16G \
-smp 8 \
-fsdev local,id=virtfs1,path=/,security_model=none \
-device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs \
-append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
TERM=xterm ip=dhcp loglevel=7'
Config is attached.
Steps to reproduce:
#
# QEMU 4.0.0 monitor - type 'help' for more information
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
(qemu)
# echo online_movable > /sys/devices/system/memory/memory79/state
[ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370
[ 23.032591] Policy zone: Normal
# (qemu) device_del dimm1
(qemu) [ 32.013950] Offlined Pages 32768
[ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022
[ 32.014843] Policy zone: Normal
[ 32.015733]
[ 32.015881] ======================================================
[ 32.016390] WARNING: possible circular locking dependency detected
[ 32.016881] 5.1.0_pt_pmem #38 Not tainted
[ 32.017202] ------------------------------------------------------
[ 32.017680] kworker/u16:4/380 is trying to acquire lock:
[ 32.018096] 00000000675cc7e1 (kn->count#18){++++}, at:
kernfs_remove_by_name_ns+0x3b/0x80
[ 32.018745]
[ 32.018745] but task is already holding lock:
[ 32.019201] 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[ 32.019859]
[ 32.019859] which lock already depends on the new lock.
[ 32.019859]
[ 32.020499]
[ 32.020499] the existing dependency chain (in reverse order) is:
[ 32.021080]
[ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
[ 32.021522] __mutex_lock+0x8b/0x900
[ 32.021843] hotplug_memory_register+0x26/0xa0
[ 32.022231] __add_pages+0xe7/0x160
[ 32.022545] add_pages+0xd/0x60
[ 32.022835] add_memory_resource+0xc3/0x1d0
[ 32.023207] __add_memory+0x57/0x80
[ 32.023530] acpi_memory_device_add+0x13a/0x2d0
[ 32.023928] acpi_bus_attach+0xf1/0x200
[ 32.024272] acpi_bus_scan+0x3e/0x90
[ 32.024597] acpi_device_hotplug+0x284/0x3e0
[ 32.024972] acpi_hotplug_work_fn+0x15/0x20
[ 32.025342] process_one_work+0x2a0/0x650
[ 32.025755] worker_thread+0x34/0x3d0
[ 32.026077] kthread+0x118/0x130
[ 32.026442] ret_from_fork+0x3a/0x50
[ 32.026766]
[ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){++++}:
[ 32.027261] get_online_mems+0x39/0x80
[ 32.027600] kmem_cache_create_usercopy+0x29/0x2c0
[ 32.028019] kmem_cache_create+0xd/0x10
[ 32.028367] ptlock_cache_init+0x1b/0x23
[ 32.028724] start_kernel+0x1d2/0x4b8
[ 32.029060] secondary_startup_64+0xa4/0xb0
[ 32.029447]
[ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){++++}:
[ 32.030007] cpus_read_lock+0x39/0x80
[ 32.030360] __offline_pages+0x32/0x790
[ 32.030709] memory_subsys_offline+0x3a/0x60
[ 32.031089] device_offline+0x7e/0xb0
[ 32.031425] acpi_bus_offline+0xd8/0x140
[ 32.031821] acpi_device_hotplug+0x1b2/0x3e0
[ 32.032202] acpi_hotplug_work_fn+0x15/0x20
[ 32.032576] process_one_work+0x2a0/0x650
[ 32.032942] worker_thread+0x34/0x3d0
[ 32.033283] kthread+0x118/0x130
[ 32.033588] ret_from_fork+0x3a/0x50
[ 32.033919]
[ 32.033919] -> #1 (&device->physical_node_lock){+.+.}:
[ 32.034450] __mutex_lock+0x8b/0x900
[ 32.034784] acpi_get_first_physical_node+0x16/0x60
[ 32.035217] acpi_companion_match+0x3b/0x60
[ 32.035594] acpi_device_uevent_modalias+0x9/0x20
[ 32.036012] platform_uevent+0xd/0x40
[ 32.036352] dev_uevent+0x85/0x1c0
[ 32.036674] kobject_uevent_env+0x1e2/0x640
[ 32.037044] kobject_synth_uevent+0x2b7/0x324
[ 32.037428] uevent_store+0x17/0x30
[ 32.037752] kernfs_fop_write+0xeb/0x1a0
[ 32.038112] vfs_write+0xb2/0x1b0
[ 32.038417] ksys_write+0x57/0xd0
[ 32.038721] do_syscall_64+0x4b/0x1a0
[ 32.039053] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 32.039491]
[ 32.039491] -> #0 (kn->count#18){++++}:
[ 32.039913] lock_acquire+0xaa/0x180
[ 32.040242] __kernfs_remove+0x244/0x2d0
[ 32.040593] kernfs_remove_by_name_ns+0x3b/0x80
[ 32.040991] device_del+0x14a/0x370
[ 32.041309] device_unregister+0x9/0x20
[ 32.041653] unregister_memory_section+0x69/0xa0
[ 32.042059] __remove_pages+0x112/0x460
[ 32.042402] arch_remove_memory+0x6f/0xa0
[ 32.042758] __remove_memory+0xab/0x130
[ 32.043103] acpi_memory_device_remove+0x67/0xe0
[ 32.043537] acpi_bus_trim+0x50/0x90
[ 32.043889] acpi_device_hotplug+0x2fa/0x3e0
[ 32.044300] acpi_hotplug_work_fn+0x15/0x20
[ 32.044686] process_one_work+0x2a0/0x650
[ 32.045044] worker_thread+0x34/0x3d0
[ 32.045381] kthread+0x118/0x130
[ 32.045679] ret_from_fork+0x3a/0x50
[ 32.046005]
[ 32.046005] other info that might help us debug this:
[ 32.046005]
[ 32.046636] Chain exists of:
[ 32.046636] kn->count#18 --> mem_hotplug_lock.rw_sem --> mem_sysfs_mutex
[ 32.046636]
[ 32.047514] Possible unsafe locking scenario:
[ 32.047514]
[ 32.047976] CPU0 CPU1
[ 32.048337] ---- ----
[ 32.048697] lock(mem_sysfs_mutex);
[ 32.048983] lock(mem_hotplug_lock.rw_sem);
[ 32.049519] lock(mem_sysfs_mutex);
[ 32.050004] lock(kn->count#18);
[ 32.050270]
[ 32.050270] *** DEADLOCK ***
[ 32.050270]
[ 32.050736] 7 locks held by kworker/u16:4/380:
[ 32.051087] #0: 00000000a22fe78e
((wq_completion)kacpi_hotplug){+.+.}, at: process_one_work+0x21e/0x650
[ 32.051830] #1: 00000000944f2dca
((work_completion)(&hpw->work)){+.+.}, at:
process_one_work+0x21e/0x650
[ 32.052577] #2: 0000000024bbe147 (device_hotplug_lock){+.+.}, at:
acpi_device_hotplug+0x2e/0x3e0
[ 32.053271] #3: 000000005cb50027 (acpi_scan_lock){+.+.}, at:
acpi_device_hotplug+0x3c/0x3e0
[ 32.053916] #4: 00000000b8d06992 (cpu_hotplug_lock.rw_sem){++++},
at: __remove_memory+0x3b/0x130
[ 32.054602] #5: 00000000897f0ef4 (mem_hotplug_lock.rw_sem){++++},
at: percpu_down_write+0x1d/0x110
[ 32.055315] #6: 0000000053e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[ 32.056016]
[ 32.056016] stack backtrace:
[ 32.056355] CPU: 4 PID: 380 Comm: kworker/u16:4 Not tainted 5.1.0_pt_pmem #38
[ 32.056923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[ 32.057720] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[ 32.058144] Call Trace:
[ 32.058344] dump_stack+0x67/0x90
[ 32.058604] print_circular_bug.cold.60+0x15c/0x195
[ 32.058989] __lock_acquire+0x17de/0x1d30
[ 32.059308] ? find_held_lock+0x2d/0x90
[ 32.059611] ? __kernfs_remove+0x199/0x2d0
[ 32.059937] lock_acquire+0xaa/0x180
[ 32.060223] ? kernfs_remove_by_name_ns+0x3b/0x80
[ 32.060596] __kernfs_remove+0x244/0x2d0
[ 32.060908] ? kernfs_remove_by_name_ns+0x3b/0x80
[ 32.061283] ? kernfs_name_hash+0xd/0x80
[ 32.061596] ? kernfs_find_ns+0x68/0xf0
[ 32.061907] kernfs_remove_by_name_ns+0x3b/0x80
[ 32.062266] device_del+0x14a/0x370
[ 32.062548] ? unregister_mem_sect_under_nodes+0x4f/0xc0
[ 32.062973] device_unregister+0x9/0x20
[ 32.063285] unregister_memory_section+0x69/0xa0
[ 32.063651] __remove_pages+0x112/0x460
[ 32.063949] arch_remove_memory+0x6f/0xa0
[ 32.064271] __remove_memory+0xab/0x130
[ 32.064579] ? walk_memory_range+0xa1/0xe0
[ 32.064907] acpi_memory_device_remove+0x67/0xe0
[ 32.065274] acpi_bus_trim+0x50/0x90
[ 32.065560] acpi_device_hotplug+0x2fa/0x3e0
[ 32.065900] acpi_hotplug_work_fn+0x15/0x20
[ 32.066249] process_one_work+0x2a0/0x650
[ 32.066591] worker_thread+0x34/0x3d0
[ 32.066925] ? process_one_work+0x650/0x650
[ 32.067275] kthread+0x118/0x130
[ 32.067542] ? kthread_create_on_node+0x60/0x60
[ 32.067909] ret_from_fork+0x3a/0x50
>
> I wonder if that ordering prevents going too deep into the
> device_unregister() call stack that you highlighted below.
>
>
> >
> > Here is the problem:
> >
> > When we offline pages we have the following call stack:
> >
> > # echo offline > /sys/devices/system/memory/memory8/state
> > ksys_write
> > vfs_write
> > __vfs_write
> > kernfs_fop_write
> > kernfs_get_active
> > lock_acquire kn->count#122 (lock for
> > "memory8/state" kn)
> > sysfs_kf_write
> > dev_attr_store
> > state_store
> > device_offline
> > memory_subsys_offline
> > memory_block_action
> > offline_pages
> > __offline_pages
> > percpu_down_write
> > down_write
> > lock_acquire mem_hotplug_lock.rw_sem
> >
> > When we unbind dax0.0 we have the following stack:
> > # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> > drv_attr_store
> > unbind_store
> > device_driver_detach
> > device_release_driver_internal
> > dev_dax_kmem_remove
> > remove_memory device_hotplug_lock
> > try_remove_memory mem_hotplug_lock.rw_sem
> > arch_remove_memory
> > __remove_pages
> > __remove_section
> > unregister_memory_section
> > remove_memory_section mem_sysfs_mutex
> > unregister_memory
> > device_unregister
> > device_del
> > device_remove_attrs
> > sysfs_remove_groups
> > sysfs_remove_group
> > remove_files
> > kernfs_remove_by_name
> > kernfs_remove_by_name_ns
> > __kernfs_remove kn->count#122
> >
> > So, lockdep found the ordering issue with the above two stacks:
> >
> > 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> > 2. mem_hotplug_lock.rw_sem -> kn->count#122
Attachment:
x86.config.bz2
Description: application/bzip