Re: [PATCH] mm/vma: Fix hugetlb accounting error in copy_vma()

From: Fedor Pchelkin
Date: Mon Jan 27 2025 - 11:45:48 EST


+Cc Mina Almasry and Mike Kravetz

On Mon, 27. Jan 14:46, Lorenzo Stoakes wrote:
> Thanks for the report.
>
> On Mon, Jan 27, 2025 at 05:32:01PM +0300, Daniil Dulov wrote:
> > In copy_vma() allocation of maple tree nodes may fail. Since page accounting
> > takes place at the close() operation for hugetlb, it is called at the error
> > path against the new_vma to account pages of the vma that was not successfully
> > copied and that shares the page_counter with the original vma. Then, when the
> > process is being terminated, vm_ops->close() is called once again against the
> > original vma, which results in a page_counter underflow.
>
> This seems like a bug in hugetlb.
>
> I really hate the solution here, it's hacky and assumes only these fields are
> meaningful for 'close twice' scenarios.
>
> We now use vma_close(), which assigns vma->vm_ops to vma_dummy_vm_ops, meaning
> no further close() invocations can occur.

Does the "close twice" scenario exactly mean ->close() is called twice
for the same object of struct vm_area_struct?

For the observed case I think that's not true. ->close() is called for
two different objects of type vm_area_struct - the first time for the
new_vma on error path of copy_vma(), the second time for the original
vma. It turns out then these objects share the same reservation map
holding page_counters at this point of time.

>
> If hugetlb is _still_ choosing to internally invoke this, it seems like it
> should have some if (vma->vm_ops == hugetlb_vm_ops) { ... } check first? That
> way it'll account for the closing twice issue.
>
> Can you easily repro in order to check a solution like that fixes your problem?
> I don't see why it shouldn't
>

Seems that wouldn't fix the problem (again, two different vma objects).
There's presumably no obvious place in hugetlb internals where this may
be fixed elegantly, at the quick glance. But yep, it does look like a bug
for hugetlb to care about.. Perhaps somehow defer the reservation map
copying?

> >
> > page_counter underflow: -1024 nr_pages=1024
> > WARNING: CPU: 1 PID: 1086 at mm/page_counter.c:55 page_counter_cancel+0xd6/0x130 mm/page_counter.c:55
> > Modules linked in:
> > CPU: 1 PID: 1086 Comm: syz-executor200 Not tainted 6.1.108-syzkaller-00078-g9ce77c16947b #0
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > Call Trace:
> > <TASK>
> > page_counter_uncharge+0x2e/0x70 mm/page_counter.c:158
> > hugetlb_cgroup_uncharge_counter+0xd2/0x420 mm/hugetlb_cgroup.c:430
> > hugetlb_vm_op_close+0x435/0x700 mm/hugetlb.c:4886
> > remove_vma+0x84/0x130 mm/mmap.c:140
> > exit_mmap+0x32f/0x7a0 mm/mmap.c:3249
> > __mmput+0x11e/0x430 kernel/fork.c:1199
> > mmput+0x61/0x70 kernel/fork.c:1221
> > exit_mm kernel/exit.c:565 [inline]
> > do_exit+0xa4a/0x2790 kernel/exit.c:858
> > do_group_exit+0xd0/0x2a0 kernel/exit.c:1021
> > __do_sys_exit_group kernel/exit.c:1032 [inline]
> > __se_sys_exit_group kernel/exit.c:1030 [inline]
> > __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:1030
> > do_syscall_x64 arch/x86/entry/common.c:51 [inline]
> > do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
> > entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > </TASK>
>
> >
> > Since there is no sense in vm accounting for a bad copy of vma, set vm_start
> > to be equal vm_end and vm_pgoff to be equal 0. Previously, a similar issue
> > has been fixed in __split_vma() in the same way [1].
> >
> > [1]: https://lore.kernel.org/all/20220719201523.3561958-1-Liam.Howlett@xxxxxxxxxx/T/
>
> Understood that we do this elsewhere, I think equally we should not do this
> there either! :)
>
> >
> > Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
> >
> > Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree")
> > Cc: stable@xxxxxxxxxxxxxxx
> > Signed-off-by: Daniil Dulov <d.dulov@xxxxxxxxxx>
> > ---
> > mm/vma.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index bb2119e5a0d0..dbc68b7cd0ec 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -1772,6 +1772,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> > return new_vma;
> >
> > out_vma_link:
> > + /* Avoid vm accounting in close() operation */
> > + new_vma->vm_start = new_vma->vm_end;
> > + new_vma->vm_pgoff = 0;
> > vma_close(new_vma);
> >
> > if (new_vma->vm_file)
> > --
> > 2.34.1
> >