Re: [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure

From: Lorenzo Stoakes
Date: Tue Sep 03 2024 - 07:00:57 EST

Next message: Barry Song: "Re: [PATCH] binder_alloc: Move alloc_page() out of mmap_rwsem to reduce the lock duration"
Previous message: claudiu beznea: "Re: [PATCH 00/16] Add initial USB support for the Renesas RZ/G3S SoC"
In reply to: Pengfei Xu: "Re: [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure"
Next in thread: Lorenzo Stoakes: "Re: [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Sep 03, 2024 at 11:07:38AM GMT, Pengfei Xu wrote:
> Hi Liam R. Howlett,
>
> Greetings!
>
> There is WARNING in __split_vma in next-20240902 in syzkaller fuzzing test.
> Bisected and found first bad commit:
> "
> 3483c95414f9 mm: change failure of MAP_FIXED to restoring the gap on failure
> "
> It's same as below patch.
> After reverted the above commit on top of next-20240902, this issue was gone.
>
> All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240903_092137___split_vma
> Syzkaller repro code: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.c
> Syzkaller repro syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.prog
> Syzkaller report: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/repro.report
> Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/kconfig_origin
> Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/bisect_info.log
> bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240903_092137___split_vma/bzImage_ecc768a84f0b8e631986f9ade3118fa37852fef0.tar.gz
> Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240903_092137___split_vma/ecc768a84f0b8e631986f9ade3118fa37852fef0_dmesg.log
>
> And "KASAN: slab-use-after-free Read in acct_collect" also pointed to the
> same commit, all detailed info:
> https://github.com/xupengfe/syzkaller_logs/tree/main/240903_090000_acct_collect
>
> "

Thanks for the report! Looking into it.

> [ 19.953726] cgroup: Unknown subsys name 'net'
> [ 20.045121] cgroup: Unknown subsys name 'rlimit'
> [ 20.138332] ------------[ cut here ]------------
> [ 20.138634] WARNING: CPU: 1 PID: 732 at include/linux/maple_tree.h:733 __split_vma+0x4d7/0x1020
> [ 20.139075] Modules linked in:
> [ 20.139245] CPU: 1 UID: 0 PID: 732 Comm: repro Not tainted 6.11.0-rc6-next-20240902-ecc768a84f0b #1
> [ 20.139779] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [ 20.140337] RIP: 0010:__split_vma+0x4d7/0x1020
> [ 20.140572] Code: 89 ee 48 8b 40 10 48 89 c7 48 89 85 00 ff ff ff e8 8e 61 a7 ff 48 8b 85 00 ff ff ff 4c 39 e8 0f 83 ea fd ff ff e8 b9 5e a7 ff <0f> 0b e9 de fd ff ff 48 8b 85 30 ff ff ff 48 83 c0 10 48 89 85 18
> [ 20.141476] RSP: 0018:ffff8880217379a0 EFLAGS: 00010293
> [ 20.141749] RAX: 0000000000000000 RBX: ffff8880132351e0 RCX: ffffffff81bf6117
> [ 20.142106] RDX: ffff888012c30000 RSI: ffffffff81bf6187 RDI: 0000000000000006
> [ 20.142457] RBP: ffff888021737aa0 R08: 0000000000000001 R09: ffffed100263d3cd
> [ 20.142814] R10: 0000000020ff9000 R11: 0000000000000001 R12: ffff888021737e40
> [ 20.143173] R13: 0000000020ff9000 R14: 0000000020ffc000 R15: ffff888013235d20
> [ 20.143529] FS: 00007eff937f9740(0000) GS:ffff88806c500000(0000) knlGS:0000000000000000
> [ 20.144308] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 20.144600] CR2: 0000000020000040 CR3: 000000001f464003 CR4: 0000000000770ef0
> [ 20.144958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 20.145313] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
> [ 20.145665] PKRU: 55555554
> [ 20.145809] Call Trace:
> [ 20.145940] <TASK>
> [ 20.146056] ? show_regs+0x6d/0x80
> [ 20.146247] ? __warn+0xf3/0x380
> [ 20.146431] ? report_bug+0x25e/0x4b0
> [ 20.146650] ? __split_vma+0x4d7/0x1020

Have repro'd locally. This is, unsurprisingly, on this line (even if trace above
doesn't decode to it unfortunately):

vma_iter_config(vmi, new->vm_start, new->vm_end);

The VMA in question spans 0x20ff9000, 0x21000000 so is 7 pages in size.

At the point of invoking vma_iter_config(), the vma iterator points at
0x20ff9001, but we try to position it to 0x20ff9000.

It seems the issue is that in do_vmi_munmap(), after vma_find() is called, we
find a VMA at 0x20ff9000, but the VMI is positioned to 0x20ff9001...!

Perhaps maple tree corruption in a previous call somehow?

I can interestingly only repro this if I clear the qemu image each time, I'm
guessing this is somehow tied to the instantiation of the cgroup setup or such?

Am continuing the investigation.

> [ 20.146864] ? report_bug+0x2cb/0x4b0
> [ 20.147070] ? __split_vma+0x4d7/0x1020
> [ 20.147281] ? __split_vma+0x4d8/0x1020
> [ 20.147492] ? handle_bug+0xf1/0x190
> [ 20.147756] ? exc_invalid_op+0x3c/0x80
> [ 20.147971] ? asm_exc_invalid_op+0x1f/0x30
> [ 20.148208] ? __split_vma+0x467/0x1020
> [ 20.148417] ? __split_vma+0x4d7/0x1020
> [ 20.148628] ? __split_vma+0x4d7/0x1020
> [ 20.148845] ? __pfx___split_vma+0x10/0x10
> [ 20.149068] ? __kasan_check_read+0x15/0x20
> [ 20.149300] ? mark_lock.part.0+0xf3/0x17b0
> [ 20.149535] ? __kasan_check_read+0x15/0x20
> [ 20.149769] vms_gather_munmap_vmas+0x178/0xf70
> [ 20.150024] do_vmi_align_munmap+0x26e/0x640
> [ 20.150257] ? __pfx___lock_acquire+0x10/0x10
> [ 20.150500] ? __pfx_do_vmi_align_munmap+0x10/0x10
> [ 20.150758] ? __this_cpu_preempt_check+0x21/0x30
> [ 20.151018] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 20.151308] ? mtree_range_walk+0x728/0xb70
> [ 20.151602] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
> [ 20.151891] ? mas_walk+0x6a7/0x8b0
> [ 20.152096] do_vmi_munmap+0x202/0x400
> [ 20.152307] __vm_munmap+0x182/0x380
> [ 20.152509] ? __pfx___vm_munmap+0x10/0x10
> [ 20.152729] ? seqcount_lockdep_reader_access.constprop.0+0xb4/0xd0
> [ 20.153061] ? lockdep_hardirqs_on+0x89/0x110
> [ 20.153299] ? trace_hardirqs_on+0x51/0x60
> [ 20.153533] ? __audit_syscall_entry+0x39c/0x500
> [ 20.153790] __x64_sys_munmap+0x62/0x90
> [ 20.154001] x64_sys_call+0x198f/0x2140
> [ 20.154212] do_syscall_64+0x6d/0x140
> [ 20.154414] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 20.154683] RIP: 0033:0x7eff9343ee5d
> [ 20.154885] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
> [ 20.155872] RSP: 002b:00007ffe28711628 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
> [ 20.156267] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff9343ee5d
> [ 20.156635] RDX: ffffffffffffff80 RSI: 0000000000001000 RDI: 0000000020ffc000
> [ 20.157002] RBP: 00007ffe28711640 R08: 0000000000000000 R09: 00007ffe28711640
> [ 20.157369] R10: 0000000000000003 R11: 0000000000000206 R12: 00007ffe287117d8
> [ 20.157737] R13: 00000000004027cc R14: 0000000000404e08 R15: 00007eff93844000
> [ 20.158122] </TASK>
> [ 20.158245] irq event stamp: 2019
> [ 20.158426] hardirqs last enabled at (2027): [<ffffffff814629e4>] console_unlock+0x224/0x240
> [ 20.158869] hardirqs last disabled at (2034): [<ffffffff814629c9>] console_unlock+0x209/0x240
> [ 20.159306] softirqs last enabled at (1954): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120
> [ 20.159797] softirqs last disabled at (2051): [<ffffffff81289969>] __irq_exit_rcu+0xa9/0x120
> [ 20.160233] ---[ end trace 0000000000000000 ]---
> "
>
> I hope it's helpful.
>
> Thanks!
>
> ---
>
> If you don't need the following environment to reproduce the problem or if you
> already have one reproduced environment, please ignore the following information.
>
> How to reproduce:
> git clone https://gitlab.com/xupengfe/repro_vm_env.git
> cd repro_vm_env
> tar -xvf repro_vm_env.tar.gz
> cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
> // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
> // You could change the bzImage_xxx as you want
> // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> You could use below command to log in, there is no password for root.
> ssh -p 10023 root@localhost
>
> After login vm(virtual machine) successfully, you could transfer reproduced
> binary to the vm by below way, and reproduce the problem in vm:
> gcc -pthread -o repro repro.c
> scp -P 10023 repro root@localhost:/root/
>
> Get the bzImage for target kernel:
> Please use target kconfig and copy it to kernel_src/.config
> make olddefconfig
> make -jx bzImage //x should equal or less than cpu num your pc has
>
> Fill the bzImage file into above start3.sh to load the target kernel in vm.
>
> Tips:
> If you already have qemu-system-x86_64, please ignore below info.
> If you want to install qemu v7.1.0 version:
> git clone https://github.com/qemu/qemu.git
> cd qemu
> git checkout -f v7.1.0
> mkdir build
> cd build
> yum install -y ninja-build.x86_64
> yum -y install libslirp-devel.x86_64
> ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> make
> make install
>
> Best Regards,
> Thanks!
>
>
> On 2024-08-30 at 00:00:55 -0400, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>
> >
> > Prior to call_mmap(), the vmas that will be replaced need to clear the
> > way for what may happen in the call_mmap(). This clean up work includes
> > clearing the ptes and calling the close() vm_ops. Some users do more
> > setup than can be restored by calling the vm_ops open() function. It is
> > safer to store the gap in the vma tree in these cases.
> >
> > That is to say that the failure scenario that existed before the
> > MAP_FIXED gap exposure is restored as it is safer than trying to undo a
> > partial mapping.
> >
> > Since abort_munmap_vmas() is only reattaching vmas with this change, the
> > function is renamed to reattach_vmas().
> >
> > There is also a secondary failure that may occur if there is not enough
> > memory to store the gap. In this case, the vmas are reattached and
> > resources freed. If the system cannot complete the call_mmap() and
> > fails to allocate with GFP_KERNEL, then the system will print a warning
> > about the failure.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx>
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> > ---
> > mm/mmap.c | 3 +--
> > mm/vma.c | 4 +--
> > mm/vma.h | 80 ++++++++++++++++++++++++++++++++++++++++---------------
> > 3 files changed, 61 insertions(+), 26 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 405e0432c78e..e1e5c78b6c3c 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1623,8 +1623,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > vm_unacct_memory(charged);
> >
> > abort_munmap:
> > - if (vms.nr_pages)
> > - abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
> > + vms_abort_munmap_vmas(&vms, &mas_detach);
> > gather_failed:
> > validate_mm(mm);
> > return error;
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 648c58da8ad4..d2d71d659d1e 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -878,7 +878,7 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > munmap_gather_failed:
> > end_split_failed:
> > modify_vma_failed:
> > - abort_munmap_vmas(mas_detach, /* closed = */ false);
> > + reattach_vmas(mas_detach);
> > start_split_failed:
> > map_count_exceeded:
> > return error;
> > @@ -923,7 +923,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > return 0;
> >
> > clear_tree_failed:
> > - abort_munmap_vmas(&mas_detach, /* closed = */ false);
> > + reattach_vmas(&mas_detach);
> > gather_failed:
> > validate_mm(mm);
> > return error;
> > diff --git a/mm/vma.h b/mm/vma.h
> > index 64b44f5a0a11..b2306d13d456 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > unsigned long start, unsigned long end, pgoff_t pgoff);
> >
> > +static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> > + struct vm_area_struct *vma, gfp_t gfp)
> > +
> > +{
> > + if (vmi->mas.status != ma_start &&
> > + ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> > + vma_iter_invalidate(vmi);
> > +
> > + __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > + mas_store_gfp(&vmi->mas, vma, gfp);
> > + if (unlikely(mas_is_err(&vmi->mas)))
> > + return -ENOMEM;
> > +
> > + return 0;
> > +}
> > +
> > #ifdef CONFIG_MMU
> > /*
> > * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
> > @@ -129,24 +145,60 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
> > struct ma_state *mas_detach, bool mm_wr_locked);
> >
> > /*
> > - * abort_munmap_vmas - Undo any munmap work and free resources
> > + * reattach_vmas() - Undo any munmap work and free resources
> > + * @mas_detach: The maple state with the detached maple tree
> > *
> > * Reattach any detached vmas and free up the maple tree used to track the vmas.
> > */
> > -static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
> > +static inline void reattach_vmas(struct ma_state *mas_detach)
> > {
> > struct vm_area_struct *vma;
> >
> > mas_set(mas_detach, 0);
> > - mas_for_each(mas_detach, vma, ULONG_MAX) {
> > + mas_for_each(mas_detach, vma, ULONG_MAX)
> > vma_mark_detached(vma, false);
> > - if (closed && vma->vm_ops && vma->vm_ops->open)
> > - vma->vm_ops->open(vma);
> > - }
> >
> > __mt_destroy(mas_detach->tree);
> > }
> >
> > +/*
> > + * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap()
> > + * operation.
> > + * @vms: The vma unmap structure
> > + * @mas_detach: The maple state with the detached maple tree
> > + *
> > + * Reattach any detached vmas, free up the maple tree used to track the vmas.
> > + * If that's not possible because the ptes are cleared (and vm_ops->closed() may
> > + * have been called), then a NULL is written over the vmas and the vmas are
> > + * removed (munmap() completed).
> > + */
> > +static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> > + struct ma_state *mas_detach)
> > +{
> > + struct ma_state *mas = &vms->vmi->mas;
> > + if (!vms->nr_pages)
> > + return;
> > +
> > + if (vms->clear_ptes)
> > + return reattach_vmas(mas_detach);
> > +
> > + /*
> > + * Aborting cannot just call the vm_ops open() because they are often
> > + * not symmetrical and state data has been lost. Resort to the old
> > + * failure method of leaving a gap where the MAP_FIXED mapping failed.
> > + */
> > + mas_set_range(mas, vms->start, vms->end);
> > + if (unlikely(mas_store_gfp(mas, NULL, GFP_KERNEL))) {
> > + pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
> > + current->comm, current->pid);
> > + /* Leaving vmas detached and in-tree may hamper recovery */
> > + reattach_vmas(mas_detach);
> > + } else {
> > + /* Clean up the insertion of the unfortunate gap */
> > + vms_complete_munmap_vmas(vms, mas_detach);
> > + }
> > +}
> > +
> > int
> > do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > struct mm_struct *mm, unsigned long start,
> > @@ -299,22 +351,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
> > return mas_prev(&vmi->mas, min);
> > }
> >
> > -static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> > - struct vm_area_struct *vma, gfp_t gfp)
> > -{
> > - if (vmi->mas.status != ma_start &&
> > - ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> > - vma_iter_invalidate(vmi);
> > -
> > - __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > - mas_store_gfp(&vmi->mas, vma, gfp);
> > - if (unlikely(mas_is_err(&vmi->mas)))
> > - return -ENOMEM;
> > -
> > - return 0;
> > -}
> > -
> > -
> > /*
> > * These three helpers classifies VMAs for virtual memory accounting.
> > */
> > --
> > 2.43.0
> >

Next message: Barry Song: "Re: [PATCH] binder_alloc: Move alloc_page() out of mmap_rwsem to reduce the lock duration"
Previous message: claudiu beznea: "Re: [PATCH 00/16] Add initial USB support for the Renesas RZ/G3S SoC"
In reply to: Pengfei Xu: "Re: [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure"
Next in thread: Lorenzo Stoakes: "Re: [PATCH v8 15/21] mm: Change failure of MAP_FIXED to restoring the gap on failure"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]