Re: [syzbot] [mm?] kernel BUG in try_to_unmap_one

From: Zi Yan
Date: Mon Mar 03 2025 - 12:17:47 EST


On 3 Mar 2025, at 11:46, David Hildenbrand wrote:

> On 02.03.25 00:40, Hillf Danton wrote:
>> On Sat, 01 Mar 2025 14:41:20 -0800
>>> Hello,
>>>
>>> syzbot found the following issue on:
>>>
>>> HEAD commit: e5d3fd687aac Add linux-next specific files for 20250218
>>> git tree: linux-next
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=12faf7f8580000
>>> kernel config: https://syzkaller.appspot.com/x/.config?x=4e945b2fe8e5992f
>>> dashboard link: https://syzkaller.appspot.com/bug?extid=fb86166504f57eff29d7
>>> compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
>>>
>>> Unfortunately, I don't have any reproducer for this issue yet.
>>>
>>> Downloadable assets:
>>> disk image: https://storage.googleapis.com/syzbot-assets/ef079ccd2725/disk-e5d3fd68.raw.xz
>>> vmlinux: https://storage.googleapis.com/syzbot-assets/99f2123d6831/vmlinux-e5d3fd68.xz
>>> kernel image: https://storage.googleapis.com/syzbot-assets/eadfc9520358/bzImage-e5d3fd68.xz
>>>
>>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>>> Reported-by: syzbot+fb86166504f57eff29d7@xxxxxxxxxxxxxxxxxxxxxxxxx
>>>
>>> evict+0x4e8/0x9a0 fs/inode.c:806
>>> __dentry_kill+0x20d/0x630 fs/dcache.c:660
>>> dput+0x19f/0x2b0 fs/dcache.c:902
>>> __fput+0x60b/0x9f0 fs/file_table.c:472
>>> task_work_run+0x24f/0x310 kernel/task_work.c:227
>>> resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
>>> exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
>>> exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
>>> __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
>>> syscall_exit_to_user_mode+0x13f/0x340 kernel/entry/common.c:218
>>> do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
>>> entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>> ------------[ cut here ]------------
>>> kernel BUG at mm/rmap.c:1858!
>>> Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
>>> CPU: 1 UID: 0 PID: 6053 Comm: syz.4.27 Not tainted 6.14.0-rc3-next-20250218-syzkaller #0
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
>>> RIP: 0010:try_to_unmap_one+0x3d0d/0x3fa0 mm/rmap.c:1858
>>> Code: c7 c7 80 93 c3 8e 48 89 da e8 ef f3 19 03 e9 68 ca ff ff e8 b5 12 ab ff 48 8b 7c 24 20 48 c7 c6 80 17 36 8c e8 94 d2 f5 ff 90 <0f> 0b e8 9c 12 ab ff 48 8b 7c 24 18 48 c7 c6 40 1c 36 8c e8 7b d2
>>> RSP: 0018:ffffc9000b1be9c0 EFLAGS: 00010246
>>> RAX: 367eb4645686ad00 RBX: 00000000f4000000 RCX: ffffc9000b1be503
>>> RDX: 0000000000000004 RSI: ffffffff8c2aaf60 RDI: ffffffff8c8156e0
>>> RBP: ffffc9000b1bedf0 R08: ffffffff903da477 R09: 1ffffffff207b48e
>>> R10: dffffc0000000000 R11: fffffbfff207b48f R12: 8000000053c008e7
>>> R13: dffffc0000000000 R14: ffffea00014f0000 R15: ffffea00014f0030
>>> FS: 00007f4d2783e6c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 000000110c465fa1 CR3: 000000002a1f6000 CR4: 00000000003526f0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> Call Trace:
>>> <TASK>
>>> __rmap_walk_file+0x420/0x5f0 mm/rmap.c:2774
>>> try_to_unmap+0x219/0x2e0
>>> unmap_folio+0x183/0x1f0 mm/huge_memory.c:3053
>>> __folio_split+0x849/0x16d0 mm/huge_memory.c:3696
>>> truncate_inode_partial_folio+0x9b1/0xdc0 mm/truncate.c:234
>>> shmem_undo_range+0x82f/0x1820 mm/shmem.c:1143
>>
>> Given folio_test_hugetlb(folio) [1], what is weird is hugetlb page in a
>> shmem mapping.
>>
>
> Right, the problem begins when we call __folio_split() on a hugetlb folio, and the issue is that we seem to find that in the pagecache.
>
> I wonder if there is some weird interaction with out recent folio split changes in next. Maybe, for some reason, we end up adding a wrong folio to the pagecache during a split (truncation), and a follow-up split (truncation) finds the wrong folio.
>
> Just a guess, though. CCing Zi Yan.

You are right. I have a fix:
https://lore.kernel.org/linux-mm/56EBE3B6-99EA-470E-B2B3-92C9C13032DF@xxxxxxxxxx/

I should have verified folio2 after it is locked and before the second split.

Best Regards,
Yan, Zi