Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

From: Zhaoyang Huang
Date: Thu Jun 13 2024 - 23:31:38 EST


On Mon, May 27, 2024 at 4:22 PM Marcin Wanat <private@xxxxxxxxxxxxxx> wrote:
>
> On 22.05.2024 12:13, Marcin Wanat wrote:
> > On 22.05.2024 07:37, Zhaoyang Huang wrote:
> >> On Tue, May 21, 2024 at 11:47 PM Marcin Wanat <private@xxxxxxxxxxxxxx>
> >> wrote:
> >>>
> >>> On 21.05.2024 03:00, Zhaoyang Huang wrote:
> >>>> On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang
> >>>> <huangzhaoyang@xxxxxxxxx> wrote:
> >>>>>
> >>>>> On Tue, May 21, 2024 at 3:42 AM Marcin Wanat
> >>>>> <private@xxxxxxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> On 15.04.2024 03:50, Zhaoyang Huang wrote:
> >>>>>> I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
> >>>>>> and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
> >>>>>> kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT
> >>>>>> affected.
> >>>>>> However, with long-term kernels 6.1.XX and 6.6.XX,
> >>>>>> (tested at least 10 different versions), this lockup always appears
> >>>>>> after 2-30 days, similar to the report in the original thread.
> >>>>>> The more load (for example, copying a lot of local files while
> >>>>>> serving 20Gbps traffic), the higher the chance that the bug will
> >>>>>> appear.
> >>>>>>
> >>>>>> I haven't been able to reproduce this during synthetic tests,
> >>>>>> but it always occurs in production on 6.1.X and 6.6.X within 2-30
> >>>>>> days.
> >>>>>> If anyone can provide a patch, I can test it on multiple machines
> >>>>>> over the next few days.
> >>>>> Could you please try this one which could be applied on 6.6
> >>>>> directly. Thank you!
> >>>> URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-
> >>>> zhaoyang.huang@xxxxxxxxxx/
> >>>>
> >>>
> >>> Unfortunately, I am unable to cleanly apply this patch against the
> >>> latest 6.6.31
> >> Please try below one which works on my v6.6 based android. Thank you
> >> for your test in advance :D
> >>
> >> mm/huge_memory.c | 22 ++++++++++++++--------
> >> 1 file changed, 14 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >
> > I have compiled 6.6.31 with this patch and will test it on multiple
> > machines over the next 30 days. I will provide an update after 30 days
> > if everything is fine or sooner if any of the hosts experience the same
> > soft lockup again.
> >
>
> First server with 6.6.31 and this patch hang today. Soft lockup changed
> to hard lockup:
>
> [26887.389623] watchdog: Watchdog detected hard LOCKUP on cpu 21
> [26887.389626] Modules linked in: nft_limit xt_limit xt_hashlimit
> ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_connlimit
> nf_conncount tls xt_set ip_set_hash_net ip_set xt_CT xt_conntrack
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
> nfnetlink rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency
> intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> rapl intel_cstate ipmi_ssif irdma ext4 mbcache ice iTCO_wdt jbd2 mgag200
> intel_pmc_bxt iTCO_vendor_support ib_uverbs i2c_algo_bit acpi_ipmi
> intel_uncore mei_me drm_shmem_helper pcspkr ib_core i2c_i801 ipmi_si
> drm_kms_helper mei lpc_ich i2c_smbus ioatdma intel_pch_thermal
> ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev tcp_bbr
> drm fuse xfs libcrc32c sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul
> crc32c_intel ixgbe polyval_clmulni ahci polyval_generic libahci mdio
> i40e libata megaraid_sas dca ghash_clmulni_intel wmi
> [26887.389682] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
> W 6.6.31.el9 #3
> [26887.389685] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x 06/04/2018
> [26887.389687] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389696] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389698] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389700] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389701] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389703] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389704] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389705] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389707] FS: 0000000000000000(0000) GS:ffff9ade20340000(0000)
> knlGS:0000000000000000
> [26887.389708] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [26887.389710] CR2: 000000002912809b CR3: 000000064401e003 CR4:
> 00000000007706e0
> [26887.389711] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [26887.389712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [26887.389713] PKRU: 55555554
> [26887.389714] Call Trace:
> [26887.389717] <NMI>
> [26887.389720] ? watchdog_hardlockup_check+0xac/0x150
> [26887.389725] ? __perf_event_overflow+0x102/0x1d0
> [26887.389729] ? handle_pmi_common+0x189/0x3e0
> [26887.389735] ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389738] ? flush_tlb_one_kernel+0xa/0x20
> [26887.389742] ? native_set_fixmap+0x65/0x80
> [26887.389745] ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389751] ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389755] ? intel_pmu_handle_irq+0x10b/0x230
> [26887.389756] ? perf_event_nmi_handler+0x28/0x50
> [26887.389759] ? nmi_handle+0x58/0x150
> [26887.389764] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389768] ? default_do_nmi+0x6b/0x170
> [26887.389770] ? exc_nmi+0x12c/0x1a0
> [26887.389772] ? end_repeat_nmi+0x16/0x1f
> [26887.389777] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389780] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389784] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389787] </NMI>
> [26887.389788] <TASK>
> [26887.389789] __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389793] folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389798] __page_cache_release+0x68/0x230
> [26887.389801] ? remove_migration_ptes+0x5c/0x80
> [26887.389807] __folio_put+0x24/0x60
> [26887.389808] __split_huge_page+0x368/0x520
> [26887.389812] split_huge_page_to_list+0x4b3/0x570
> [26887.389816] deferred_split_scan+0x1c8/0x290
> [26887.389819] do_shrink_slab+0x12f/0x2d0
> [26887.389824] shrink_slab_memcg+0x133/0x1d0
> [26887.389829] shrink_node_memcgs+0x18e/0x1d0
> [26887.389832] shrink_node+0xa7/0x370
> [26887.389836] balance_pgdat+0x332/0x6f0
> [26887.389842] kswapd+0xf0/0x190
> [26887.389845] ? balance_pgdat+0x6f0/0x6f0
> [26887.389848] kthread+0xee/0x120
> [26887.389851] ? kthread_complete_and_exit+0x20/0x20
> [26887.389853] ret_from_fork+0x2d/0x50
> [26887.389857] ? kthread_complete_and_exit+0x20/0x20
> [26887.389859] ret_from_fork_asm+0x11/0x20
> [26887.389864] </TASK>
> [26887.389865] Kernel panic - not syncing: Hard LOCKUP
> [26887.389867] CPU: 21 PID: 264 Comm: kswapd0 Kdump: loaded Tainted: G
> W 6.6.31.el9 #3
> [26887.389869] Hardware name: FUJITSU PRIMERGY RX2540 M4/D3384-A1, BIOS
> V5.0.0.12 R1.22.0 for D3384-A1x 06/04/2018
> [26887.389870] Call Trace:
> [26887.389871] <NMI>
> [26887.389872] dump_stack_lvl+0x44/0x60
> [26887.389877] panic+0x241/0x330
> [26887.389881] nmi_panic+0x2f/0x40
> [26887.389883] watchdog_hardlockup_check+0x119/0x150
> [26887.389886] __perf_event_overflow+0x102/0x1d0
> [26887.389889] handle_pmi_common+0x189/0x3e0
> [26887.389893] ? set_pte_vaddr_p4d+0x4a/0x60
> [26887.389896] ? flush_tlb_one_kernel+0xa/0x20
> [26887.389899] ? native_set_fixmap+0x65/0x80
> [26887.389902] ? ghes_copy_tofrom_phys+0x75/0x110
> [26887.389906] ? __ghes_peek_estatus.isra.0+0x49/0xb0
> [26887.389909] intel_pmu_handle_irq+0x10b/0x230
> [26887.389911] perf_event_nmi_handler+0x28/0x50
> [26887.389913] nmi_handle+0x58/0x150
> [26887.389916] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389920] default_do_nmi+0x6b/0x170
> [26887.389922] exc_nmi+0x12c/0x1a0
> [26887.389923] end_repeat_nmi+0x16/0x1f
> [26887.389926] RIP: 0010:native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389930] Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0
> a9 00 01 ff ff 0f 85 ea 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3
> 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
> [26887.389931] RSP: 0018:ffffb3e587a87a20 EFLAGS: 00000002
> [26887.389933] RAX: 0000000000000001 RBX: ffff9ad6c6f67050 RCX:
> 0000000000000000
> [26887.389934] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9ad6c6f67050
> [26887.389935] RBP: ffff9ad6c6f67050 R08: 0000000000000000 R09:
> 0000000000000067
> [26887.389936] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000000046
> [26887.389937] R13: 0000000000000200 R14: 0000000000000000 R15:
> ffffe1138aa98000
> [26887.389940] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389943] ? native_queued_spin_lock_slowpath+0x6e/0x2c0
> [26887.389946] </NMI>
> [26887.389947] <TASK>
> [26887.389947] __raw_spin_lock_irqsave+0x3d/0x50
> [26887.389950] folio_lruvec_lock_irqsave+0x5e/0x90
> [26887.389953] __page_cache_release+0x68/0x230
> [26887.389955] ? remove_migration_ptes+0x5c/0x80
> [26887.389958] __folio_put+0x24/0x60
> [26887.389960] __split_huge_page+0x368/0x520
> [26887.389963] split_huge_page_to_list+0x4b3/0x570
> [26887.389967] deferred_split_scan+0x1c8/0x290
> [26887.389971] do_shrink_slab+0x12f/0x2d0
> [26887.389974] shrink_slab_memcg+0x133/0x1d0
> [26887.389978] shrink_node_memcgs+0x18e/0x1d0
> [26887.389982] shrink_node+0xa7/0x370
> [26887.389985] balance_pgdat+0x332/0x6f0
> [26887.389991] kswapd+0xf0/0x190
> [26887.389994] ? balance_pgdat+0x6f0/0x6f0
> [26887.389997] kthread+0xee/0x120
> [26887.389998] ? kthread_complete_and_exit+0x20/0x20
> [26887.390000] ret_from_fork+0x2d/0x50
> [26887.390003] ? kthread_complete_and_exit+0x20/0x20
> [26887.390004] ret_from_fork_asm+0x11/0x20
> [26887.390009] </TASK>
>
Hi Marcin. Sorry for this late reply. I think the above hard lockup is
caused by a recursive deadlock as [1] and has been fixed by [2] which
is on v6.8+. I would like to know if your regression test is still
going on? Thanks very much.

[1]
static void __split_huge_page(struct page *page, struct list_head *list,
pgoff_t end, unsigned int new_order)
{
/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
lruvec = folio_lruvec_lock(folio);
//take lruvec_lock
here 1st

for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
__split_huge_page_tail(folio, i, lruvec, list, new_order);
/* Some pages can be beyond EOF: drop them from page cache */
if (head[i].index >= end) {
folio_put(tail);
__page_cache_release
folio_lruvec_lock_irqsave
//hanged by 2nd try

[2]
commit f1ee018baee9f4e724e08859c2559323be768be3
Author: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
Date: Tue Feb 27 17:42:42 2024 +0000

mm: use __page_cache_release() in folios_put()

Pass a pointer to the lruvec so we can take advantage of the
folio_lruvec_relock_irqsave(). Adjust the calling convention of
folio_lruvec_relock_irqsave() to suit and add a page_cache_release()
wrapper.