Re: [PATCH 3/4] percpu: implement partial chunk depopulation
From: Dennis Zhou
Date: Fri Jul 02 2021 - 15:45:45 EST
Hello,
On Fri, Jul 02, 2021 at 12:11:40PM -0700, Guenter Roeck wrote:
> Hi,
>
> On Mon, Apr 19, 2021 at 10:50:46PM +0000, Dennis Zhou wrote:
> > From: Roman Gushchin <guro@xxxxxx>
> >
> > This patch implements partial depopulation of percpu chunks.
> >
> > As of now, a chunk can be depopulated only as a part of the final
> > destruction, if there are no more outstanding allocations. However
> > to minimize a memory waste it might be useful to depopulate a
> > partially filed chunk, if a small number of outstanding allocations
> > prevents the chunk from being fully reclaimed.
> >
> > This patch implements the following depopulation process: it scans
> > over the chunk pages, looks for a range of empty and populated pages
> > and performs the depopulation. To avoid races with new allocations,
> > the chunk is previously isolated. After the depopulation the chunk is
> > sidelined to a special list or freed. New allocations prefer using
> > active chunks to sidelined chunks. If a sidelined chunk is used, it is
> > reintegrated to the active lists.
> >
> > The depopulation is scheduled on the free path if the chunk is all of
> > the following:
> > 1) has more than 1/4 of total pages free and populated
> > 2) the system has enough free percpu pages aside of this chunk
> > 3) isn't the reserved chunk
> > 4) isn't the first chunk
> > If it's already depopulated but got free populated pages, it's a good
> > target too. The chunk is moved to a special slot,
> > pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
> > item is scheduled. On isolation, these pages are removed from the
> > pcpu_nr_empty_pop_pages. It is constantly replaced to the
> > to_depopulate_slot when it meets these qualifications.
> >
> > pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
> > becomes empty. The depopulation is performed in the reverse direction to
> > keep populated pages close to the beginning. Depopulated chunks are
> > sidelined to preferentially avoid them for new allocations. When no
> > active chunk can suffice a new allocation, sidelined chunks are first
> > checked before creating a new chunk.
> >
> > Signed-off-by: Roman Gushchin <guro@xxxxxx>
> > Co-developed-by: Dennis Zhou <dennis@xxxxxxxxxx>
> > Signed-off-by: Dennis Zhou <dennis@xxxxxxxxxx>
>
> This patch results in a number of crashes and other odd behavior
> when trying to boot mips images from Megasas controllers in qemu.
> Sometimes the boot stalls, but I also see various crashes.
> Some examples and bisect logs are attached.
Ah, this doesn't look good.. Do you have a reproducer I could use to
debug this?
Thanks,
Dennis
>
> Note: Bisect on mainline ended with
>
> # first bad commit: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/l
> inux/kernel/git/dennis/percpu
>
> I then checked out the merge branch and ran a bisect there, which
> points to this commit. I also rebased the merge branch to v5.13
> and bisected again. Bisect results were the same.
>
> Guenter
>
> ---
> ...
> sd 0:2:0:0: [sda] Add. Sense: Internal target failure
> CPU 0 Unable to handle kernel paging request at virtual address 00000004, epc == 805cf8fc, ra == 802ff3b0
> Oops[#1]:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
> $ 0 : 00000000 00000001 00000000 8258fc90
> $ 4 : 825dbd40 820e7624 00000000 fffffff0
> $ 8 : 80c70000 805e1a64 fffffffc 00000000
> $12 : 81006d00 0000001f ffffffe0 00001e83
> $16 : 00000000 825dbd30 80c70000 820e75f8
> $20 : 8275c584 80cc4418 80c9409c 00000008
> $24 : 0000004c 00000000
> $28 : 8204c000 8204fc70 80c26c54 802ff3b0
> Hi : 0000004c
> Lo : 00000000
> epc : 805cf8fc rb_insert_color+0x1c/0x1e0
> ra : 802ff3b0 kernfs_link_sibling+0x94/0x120
> Status: 1000a403 KERNEL EXL IE
> Cause : 00800008 (ExcCode 02)
> BadVA : 00000004
> PrId : 00019300 (MIPS 24Kc)
> Modules linked in:
> Process swapper/0 (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000)
> Stack : 820e75f8 820e75f8 820e75f8 00000000 8275c584 fffffffe 825dbd30 8030084c
> 820e75f8 803003f8 00000000 db668853 00000000 801655f4 00000000 00000000
> 00000001 825dbd30 820e75f8 820e75f8 00000000 80300970 81006c80 82150fc0
> 8204fd64 00000001 00000000 00000001 00000000 00000000 82150fc0 8275c580
> 80c50000 80303dc8 82150fc0 8015ab94 81006c80 8015a960 00000000 8275c580
> ...
> Call Trace:
> [<805cf8fc>] rb_insert_color+0x1c/0x1e0
> [<802ff3b0>] kernfs_link_sibling+0x94/0x120
> [<8030084c>] kernfs_add_one+0xb8/0x184
> [<80300970>] kernfs_create_dir_ns+0x58/0xb0
> [<80303dc8>] sysfs_create_dir_ns+0x74/0x108
> [<805ca51c>] kobject_add_internal+0xb4/0x364
> [<805caaa0>] kobject_init_and_add+0x64/0xa8
> [<8066f768>] bus_add_driver+0x98/0x230
> [<806715a0>] driver_register+0x80/0x144
> [<807c17b8>] usb_register_driver+0xa8/0x1c0
> [<80cb89b8>] uas_init+0x44/0x78
> [<8010065c>] do_one_initcall+0x50/0x1d4
> [<80c95014>] kernel_init_freeable+0x20c/0x29c
> [<80a66bd4>] kernel_init+0x14/0x118
> [<80103098>] ret_from_kernel_thread+0x14/0x1c
>
> Code: 30460001 14c00016 240afffc <8c460004> 34480001 10c30028 00404825 10c00012 00000000
>
> ---[ end trace bb7aba36814796cb ]---
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
>
> ---
> scsi host0: Avago SAS based MegaRAID driver
> ata_piix 0000:00:0a.1: enabling device (0000 -> 0001)
> random: fast init done
> scsi 0:2:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5
> scsi host1: ata_piix
> BUG: spinlock bad magic on CPU#0, kworker/u2:1/41
> lock: 0x82598a50, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> CPU: 0 PID: 41 Comm: kworker/u2:1 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
> Workqueue: events_unbound async_run_entry_fn
> Stack : 822839e4 80c4eee3 80c50000 80a54338 80c80000 801865e0 00000000 00000004
> 822839e4 5e03e26b 80c50000 8014b4c8 80c50000 00000001 822839e0 8207e4c0
> 00000000 00000000 80b7b9ac 82283828 00000001 8228383c 00000000 0000ffff
> 00000008 00000007 00000280 822b7c00 80c50000 80c70000 00000000 80b80000
> 00000003 00000000 80c50000 00000012 00000000 806591f8 00000000 80cf0000
> ...
> Call Trace:
> [<80109adc>] show_stack+0x84/0x11c
> [<80a62b1c>] dump_stack+0xa8/0xe4
> [<80181468>] do_raw_spin_lock+0xb0/0x128
> [<80a70170>] _raw_spin_lock_irqsave+0x28/0x3c
> [<80176640>] __wake_up_common_lock+0x68/0xe8
> [<801766d4>] __wake_up+0x14/0x20
> [<8054eb48>] percpu_ref_kill_and_confirm+0x120/0x178
> [<80526d2c>] blk_freeze_queue_start+0x58/0x94
> [<8051af0c>] blk_set_queue_dying+0x2c/0x60
> [<8051afb4>] blk_cleanup_queue+0x40/0x130
> [<806975b4>] __scsi_remove_device+0xd4/0x168
> [<80693594>] scsi_probe_and_add_lun+0x53c/0xf44
> [<806944c4>] __scsi_scan_target+0x158/0x754
> [<80694eb4>] scsi_scan_host_selected+0x17c/0x2e0
> [<806950c4>] do_scsi_scan_host+0xac/0xb4
> [<806952f8>] do_scan_async+0x30/0x228
> [<8015510c>] async_run_entry_fn+0x40/0x100
> [<80148384>] process_one_work+0x170/0x428
> [<80148be0>] worker_thread+0x188/0x578
> [<80150d9c>] kthread+0x130/0x160
> [<80103098>] ret_from_kernel_thread+0x14/0x1c
>
> CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 801764a0, ra == 80176664
> Oops[#1]:
> CPU: 0 PID: 41 Comm: kworker/u2:1 Not tainted 5.13.0-00005-g0bd2212ebd7a #1
> Workqueue: events_unbound async_run_entry_fn
> $ 0 : 00000000 00000001 00000000 00000000
> $ 4 : 82283b14 00000003 00000000 00000000
> $ 8 : 00000001 822837ac 00000000 0000ffff
> $12 : 00000008 00000007 00000280 822b7c00
> $16 : 82598a50 00000000 82283b08 00000003
> $20 : 00000000 00000000 00000000 fffffff4
> $24 : 00000000 806591f8
> $28 : 82280000 82283ab8 82598a60 80176664
> Hi : 000000a7
> Lo : 3333335d
> epc : 801764a0 __wake_up_common+0x6c/0x1a4
> ra : 80176664 __wake_up_common_lock+0x8c/0xe8
> Status: 1000a402 KERNEL EXL
> Cause : 40808008 (ExcCode 02)
> BadVA : 00000000
> PrId : 00019300 (MIPS 24Kc)
> Modules linked in:
> Process kworker/u2:1 (pid: 41, threadinfo=(ptrval), task=(ptrval), tls=00000000)
> Stack : 82716400 82283c98 8246a880 8052cfe8 82598a50 00000000 00000000 00000000
> 00000003 00000000 80c50000 00000012 80c80000 80176664 00000000 82126c80
> 801762ec 82283afc 00000000 82283b08 00000000 00000000 00000000 82283b14
> 82283b14 5e03e26b 825985c8 80c70000 00000001 00000000 80d10000 00000024
> 00000003 801766d4 00000001 825985c8 80c70000 00000001 00000000 80d10000
> ...
> Call Trace:
> [<801764a0>] __wake_up_common+0x6c/0x1a4
> [<80176664>] __wake_up_common_lock+0x8c/0xe8
> [<801766d4>] __wake_up+0x14/0x20
> [<8054eb48>] percpu_ref_kill_and_confirm+0x120/0x178
> [<80526d2c>] blk_freeze_queue_start+0x58/0x94
> [<8051af0c>] blk_set_queue_dying+0x2c/0x60
> [<8051afb4>] blk_cleanup_queue+0x40/0x130
> [<806975b4>] __scsi_remove_device+0xd4/0x168
> [<80693594>] scsi_probe_and_add_lun+0x53c/0xf44
> [<806944c4>] __scsi_scan_target+0x158/0x754
> [<80694eb4>] scsi_scan_host_selected+0x17c/0x2e0
> [<806950c4>] do_scsi_scan_host+0xac/0xb4
> [<806952f8>] do_scan_async+0x30/0x228
> [<8015510c>] async_run_entry_fn+0x40/0x100
> [<80148384>] process_one_work+0x170/0x428
> [<80148be0>] worker_thread+0x188/0x578
> [<80150d9c>] kthread+0x130/0x160
> [<80103098>] ret_from_kernel_thread+0x14/0x1c
>
> ---
>
> megaraid_sas 0000:00:14.0: Max firmware commands: 1007 shared with default hw_queues = 1 poll_queues 0
> scsi host0: Avago SAS based MegaRAID driver
> ata_piix 0000:00:0a.1: enabling device (0000 -> 0001)
> scsi 0:2:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5
> scsi host1: ata_piix
> scsi host2: ata_piix
> CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 00000000, ra == 8019d0b4
> Oops[#1]:
> CPU: 0 PID: 40 Comm: kworker/u2:1 Not tainted 5.13.0-07637-g3dbdb38e2869 #1
> Workqueue: events_unbound async_run_entry_fn
> $ 0 : 00000000 00000001 82568620 00000000
> $ 4 : 82568620 00000200 8019d0b4 8212d580
> $ 8 : ffffffe0 000003fc 00000000 81006d70
> $12 : 81006d40 0000020c 00000000 80ab4400
> $16 : 81007480 00000008 8201ff00 0000000a
> $20 : 00000000 810074bc 80ced800 80cd0000
> $24 : 000b0f1b 00000739
> $28 : 82298000 8201fee8 8019d2f8 8019d0b4
> Hi : 00003f05
> Lo : 0000000f
> epc : 00000000 0x0
> ra : 8019d0b4 rcu_core+0x260/0x754
> Status: 1000a403 KERNEL EXL IE
> Cause : 00800008 (ExcCode 02)
> BadVA : 00000000
> PrId : 00019300 (MIPS 24Kc)
> Modules linked in:
> Process kworker/u2:1 (pid: 40, threadinfo=(ptrval), task=(ptrval), tls=00000000)
> Stack : 00000000 8018b544 ffffffc8 ffffffc8 80bde598 00000000 00000000 8201ff00
> 00000048 2942dcc1 80ccc2c8 80cb8080 80d68358 0000000a 00000024 00000009
> 00000100 80cb80a4 00000000 80aaac38 80cd0000 80cba400 80cba400 80191214
> 00014680 2942dcc1 80cf9980 80ab3ce0 80bd9020 80d682f4 80d6e880 80d6e880
> ffff8fcf 80cd0000 80ab0000 04208060 80ccc2c8 00000001 00000020 80da0000
> ...
> Call Trace:
>
> [<8018b544>] __handle_irq_event_percpu+0xbc/0x184
> [<80aaac38>] __do_softirq+0x190/0x33c
> [<80191214>] handle_level_irq+0x130/0x1e8
> [<80132fb8>] irq_exit+0x130/0x138
> [<806112d0>] plat_irq_dispatch+0x9c/0x118
> [<80103404>] handle_int+0x144/0x150
>
> Code: (Bad address in epc)
>
>
> ---[ end trace 5d4c5bf55a0bb13f ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> ------------[ cut here ]------------
>
> ---
> Bisect on mainline:
>
> # bad: [3dbdb38e286903ec220aaf1fb29a8d94297da246] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
> # good: [007b350a58754a93ca9fe50c498cc27780171153] Merge tag 'dlm-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
> git bisect start '3dbdb38e2869' '007b350a5875'
> # good: [b6df00789e2831fff7a2c65aa7164b2a4dcbe599] Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> git bisect good b6df00789e2831fff7a2c65aa7164b2a4dcbe599
> # good: [990ec3014deedfed49e610cdc31dc6930ca63d8d] drm/amdgpu: add psp runtime db structures
> git bisect good 990ec3014deedfed49e610cdc31dc6930ca63d8d
> # good: [c288d9cd710433e5991d58a0764c4d08a933b871] Merge tag 'for-5.14/io_uring-2021-06-30' of git://git.kernel.dk/linux-block
> git bisect good c288d9cd710433e5991d58a0764c4d08a933b871
> # good: [514798d36572fb8eba6ccff3de10c9615063a7f5] Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
> git bisect good 514798d36572fb8eba6ccff3de10c9615063a7f5
> # good: [630e438f040c3838206b5e6717b9b5c29edf3548] RDMA/rtrs: Introduce head/tail wr
> git bisect good 630e438f040c3838206b5e6717b9b5c29edf3548
> # good: [a32b344e6f4375c5bdc3e89d0997b7eae187a3b1] Merge tag 'pinctrl-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
> git bisect good a32b344e6f4375c5bdc3e89d0997b7eae187a3b1
> # good: [cad065ed8d8831df67b9754cc4437ed55d8b48c0] MIPS: MT extensions are not available on MIPS32r1
> git bisect good cad065ed8d8831df67b9754cc4437ed55d8b48c0
> # good: [e4d777003a43feab2e000749163e531f6c48c385] percpu: optimize locking in pcpu_balance_workfn()
> git bisect good e4d777003a43feab2e000749163e531f6c48c385
> # bad: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
> git bisect bad e267992f9ef0bf717d70a9ee18049782f77e4b3a
> # good: [ab3040e1379bd6fcc260f1f7558ee9c2da62766b] MIPS: Ingenic: Add MAC syscon nodes for Ingenic SoCs.
> git bisect good ab3040e1379bd6fcc260f1f7558ee9c2da62766b
> # good: [34c522a07ccbfb0e6476713b41a09f9f51a06c9f] MIPS: CI20: Add second percpu timer for SMP.
> git bisect good 34c522a07ccbfb0e6476713b41a09f9f51a06c9f
> # good: [19b438592238b3b40c3f945bb5f9c4ca971c0c45] Merge tag 'mips_5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
> git bisect good 19b438592238b3b40c3f945bb5f9c4ca971c0c45
> # first bad commit: [e267992f9ef0bf717d70a9ee18049782f77e4b3a] Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
>
> ---
> Bisect on merge branch:
>
> # bad: [e4d777003a43feab2e000749163e531f6c48c385] percpu: optimize locking in pcpu_balance_workfn()
> # good: [d434405aaab7d0ebc516b68a8fc4100922d7f5ef] Linux 5.12-rc7
> git bisect start 'HEAD' 'v5.12-rc7'
> # bad: [f183324133ea535db4127f9fad3e19725ca88bf3] percpu: implement partial chunk depopulation
> git bisect bad f183324133ea535db4127f9fad3e19725ca88bf3
> # good: [67c2669d69fb5ada0f3b5123fb6ebf6fef9faee5] percpu: split __pcpu_balance_workfn()
> git bisect good 67c2669d69fb5ada0f3b5123fb6ebf6fef9faee5
> # good: [1c29a3ceaf5f02919e0a89119a70382581453dbb] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
> git bisect good 1c29a3ceaf5f02919e0a89119a70382581453dbb
> # first bad commit: [f183324133ea535db4127f9fad3e19725ca88bf3] percpu: implement partial chunk depopulation
>
> ---
> Bisect on rebased merge branch:
>
> # bad: [737dc4074d4969ee54d7f781591bcc608fc6990f] percpu: optimize locking in pcpu_balance_workfn()
> # good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
> git bisect start 'HEAD' 'v5.13'
> # bad: [0bd2212ebd7a02a6c0e870bb4b35abc321c203bc] percpu: implement partial chunk depopulation
> git bisect bad 0bd2212ebd7a02a6c0e870bb4b35abc321c203bc
> # good: [a7aebdb482a3aa87a61f6414a87f31eb657c41f6] percpu: split __pcpu_balance_workfn()
> git bisect good a7aebdb482a3aa87a61f6414a87f31eb657c41f6
> # good: [123a0c4318bb8cfb984f41c0499064c383dd9eee] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
> git bisect good 123a0c4318bb8cfb984f41c0499064c383dd9eee
> # first bad commit: [0bd2212ebd7a02a6c0e870bb4b35abc321c203bc] percpu: implement partial chunk depopulation