Re: [RFC PATCH] mm: batch link_file_vma calls in dup_mmap
From: Pedro Falcato
Date: Tue Jun 16 2026 - 06:08:17 EST
On Tue, Jun 16, 2026 at 05:13:02PM +0800, Yibin Liu wrote:
> Forking a process with many file-backed mappings sharing the same
> file (e.g. a dynamically linked binary with several mappings into
> the same shared library) repeatedly acquires and releases the
> mapping i_mmap_rwsem in dup_mmap(), once per vma, as each vma is
> inserted into the address_space interval tree.
>
> Mirror the unlink_file_vma_batch mechanism added for free_pgd_range()
> by commit 3577dbb19241 ("mm: batch unlink_file_vma calls in
> free_pgd_range") and apply the same idea on the vma creation side:
> introduce link_vma_file_batch, which gathers consecutive vmas backed
> by the same file and inserts them into the interval tree under a
> single i_mmap_lock_write()/i_mmap_unlock_write() pair instead of one
> pair per vma.
>
> Unlike the unlink side, vma_interval_tree_insert_after() needs both
> the new vma and the vma it is inserted after, so the batch keeps a
> parallel old_vmas[] array alongside new_vmas[] rather than the single
> vmas[] array used by unlink_vma_file_batch.
>
> link_file_vma_batch_add() is wired into dup_mmap()'s vma copy loop in
> place of the inline i_mmap_lock_write()/vma_interval_tree_insert_after()
> sequence, and link_file_vma_batch_final() flushes any pending batch
> both on the successful loop exit and on every error path that jumps
> to loop_out, so the interval tree is never left out of sync with the
> vmas already linked into the maple tree.
>
> Tested with the same doexec benchmark used by 3577dbb19241:
> http://apollo.backplane.com/DFlyMisc/doexec.c
>
> $ cc -O2 -o shared-doexec doexec.c
> $ ./shared-doexec $(nproc)
>
> Run on an AMD EPYC 9754 with 512 threads, execs per second improved
> by roughly 2%-7% over the unpatched kernel across repeated runs.
>
> Signed-off-by: Yibin Liu <liuyibin@xxxxxxxx>
> ---
> mm/mmap.c | 14 ++++----------
> mm/vma.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
> mm/vma.h | 14 ++++++++++++++
> 3 files changed, 67 insertions(+), 10 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2311ae7c2..d5a4312df 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1735,6 +1735,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> unsigned long charge = 0;
> LIST_HEAD(uf);
> VMA_ITERATOR(vmi, mm, 0);
> + struct link_vma_file_batch vb;
>
> if (mmap_write_lock_killable(oldmm))
> return -EINTR;
> @@ -1758,6 +1759,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> if (unlikely(retval))
> goto out;
>
> + link_file_vma_batch_init(&vb);
> mt_clear_in_rcu(vmi.mas.tree);
> for_each_vma(vmi, mpnt) {
> struct file *file;
> @@ -1822,18 +1824,9 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>
> file = tmp->vm_file;
> if (file) {
> - struct address_space *mapping = file->f_mapping;
> -
> get_file(file);
> - i_mmap_lock_write(mapping);
> - if (vma_is_shared_maywrite(tmp))
> - mapping_allow_writable(mapping);
> - flush_dcache_mmap_lock(mapping);
> /* insert tmp into the share list, just after mpnt */
> - vma_interval_tree_insert_after(tmp, mpnt,
> - &mapping->i_mmap);
> - flush_dcache_mmap_unlock(mapping);
> - i_mmap_unlock_write(mapping);
> + link_file_vma_batch_add(&vb, tmp, mpnt);
This does not work, it introduces subtly races between rmap and fork().
Consider this:
1) we dup VMA A
2) we copy over the pages
3) concurrently, someone does an rmap walk for the file
4) finally, insert the VMAs into the interval tree
The rmap walk will not find every mapping of the folio it's looking at (we
haven't inserted the new VMAs yet), and it will be very confused.
--
Pedro