RE: [PATCH 0/15] mm: introduce ANON_VMA_LAZY for deferred anon_vma creation

From: wangtao

Date: Wed Jun 03 2026 - 07:15:04 EST


> >
>
> Against my better judgment I'll address the stuff here...
>
> > VMA operations can be roughly divided into three categories. The
> > handling of ANON_VMA_LAZY is briefly described below.
>
> I don't agree, there are plenty more VMA operations. But with respect to
> anon rmap there are:
>
> - fork
> - merge/split
> - remap
>

Yes, these are the three categories. I originally intended to explain them
by classifying based on system calls; I should have used mremap instead of move_vma.

是的,是这三类,我本想从系统调用去分类说明,应该将move_vma换成mremap的。

> Your approach seems to completely ignore VMA split and the need to
> maintain an interval tree to _multiple_ VMAs from a single anon_vma.
>

The folio uses vma->root_vma to compute folio_address. A VMA split from it,
vma_a, also uses vma_a->root_vma = vma->root_vma to compute folio_address.
During rmap, once folio_address is obtained, the VMA can be found through
mm_mt. Without fork, there is no need to maintain the interval tree.

folio使用vma->root_vma 计算folio_address;从vma拆分出的vma_a,使用vma_a->root_vma = vma->root_vma计算folio_address。
rmap时得到folio_address就可以通过mm_mt查找到vma。
不fork就不需要维护interval tree。

> You may also actually split a VMA against a single large folio (waiting on the
> deferred shrinker) and have a SINGLE _leaf_ anonymous folio that is mapped
> in two places.
>
> The lazy approach doesn't seem to address this properly. And fatally it ties an
> actual VMA afaict to the folio and has to implement a VMA reference count
> mechanism which interferes with the ordinarily VMA lifecycle to do it.
>
> The fact of us taking advantage of most stuff being AnonExclusive, i.e.
> 'leaves' is something that my approach is exactly taking into account.
>
> Of course also extending anon_vma is a real non-starter.
>
> Also the below + the series ignores MAP_PRIVATE file-backed mappings
> which is a pretty fatal flaw.
>
> It also, as Harry says, has zero description of correctness in a way we'd want
> and no tests.
>

可以正确处理拆分vma在一个大页。拆分的vma_a或vma_b上的sub_page使用如下方式计算地址。
对于文件vma的cow 匿名页,也用同样方式计算page/folio地址。

It can correctly handle the case where a VMA is split within a large
page. The address of a sub_page in the split VMA (vma_a or vma_b) is
computed using the following method.

For COW anonymous pages originating from file VMAs, the page/folio
address is also computed using the same method.

subpage_address = vma_address(vma_a, subpage_pgoff, 1)
= vma_a->vm_start + (subpage_pgoff - vma_a->vm_pgoff) * PAGE_SIZE
= vma_a->vm_start - vma_a->vm_pgoff * PAGE_SIZE + subpage_pgoff * PAGE_SIZE
= vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE
= vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE

> >
> > 1. fork
> >
> > fork duplicates the parent's mm/mmap. (exec creates a new mm/mmap
> and
> > is not involved here.) This can be viewed as copying the VMAs with
> > identical virtual addresses into a new address space.
> >
> > If the parent VMA (pvma) is ANON_VMA_LAZY, it is first upgraded to a
> > regular anon_vma. The corresponding folio->mapping is then fixed in
> > try_dup_anon_rmap().
>
> And so we make fork, a very sensitive path in the kernel more expensive.
>
> I also question the locking situation with the conversion mentioned, updating
> folios in this manner is extremely difficult.
>

Because rmap takes the PTE lock, while fork takes the mmap write lock,
the VMA write lock, and the PTE lock.

Given the rule that folio->mapping can only transition in one direction
from lazy_vma to a regular anon_vma, the situation can be handled
correctly even without taking the folio_lock.

When rmap and fork run concurrently:
If rmap observes folio->mapping as a regular anon_vma, there is
obviously no issue.
If rmap observes folio->mapping as lazy_vma, then rmap only processes
the parent's pvma. At the end of rmap_walk_anon(), if we see that folio->mapping has
changed to a regular anon_vma, we simply process it once more. The
various rmap_one implementations are idempotent anyway.

BTW: the commit message of patch 13 says a retry is needed, but the
retry handling was accidentally omitted in the posted patch.

因为rmap获取pte锁;fork时获取mmap写锁、vma写锁、pte锁。
只允许folio->mapping从lazy_vma单向变成regular anon_vma的原则,不获取folio_lock也可以正确处理。
当rmap和fork并发处理时:
假如rmap看到的folio->mapping是regular anon_vma,显然没有问题。
假如rmap看到的folio->mapping是lazy_vma,则rmap只处理了父进程的pvma;
我们在rmap_walk_anon结束时如果看到folio->mapping变成了regular anon_vma,则再来一次处理即可,毕竟各种rmap_one实现是幂等的。
btw:patch 13的commit msg说要retry,但是发送的patch由于操作失误漏掉了重试处理。

> >
> > 2. mmap / brk / mprotect / munmap
> >
> > These operations create, modify, or remove VMAs in the current mm.
> > They may split existing VMAs, merge adjacent VMAs, or remove a VMA
> from mm_mt.
>
> mmap and brk are not at all relevant to anon_vma, as no anon_vma is
> assigned upon mapping. It's on fault.
>
mmap/brk 指定地址时可能导致匿名 VMA merge 或 split。

mmap()/brk() with a specified address may cause anonymous VMA merge or split.

> mprotect/mlock/munmap/etc. might split, but I don't see how the lazy
> approach in any way addresses any of that.
>
上边说了,split后rmap仍使用root_vma计算folio_address或page_address。

As mentioned above, after the split, rmap still uses root_vma to compute
folio_address or page_address.

> >
> > When a new VMA is created, vm_start, vm_end and vm_pgoff are
> > initialized and the VMA is inserted into mm_mt. Although these fields
> > may later be modified, the following value remains invariant:
> >
> > (vm_start - vm_pgoff * PAGE_SIZE)
>
> Err no it doesn't at all?
>
> If I fault in a VMA at vm_start, vm_pgoff = vm_start >> PAGE_SHIFT.
>
> Then if I remap it, vm_start changes, vm_pgoff stays the same, so:
>
> vm_start - vm_pgoff * PAGE_SIZE
>
> Changes right? And then that becomes essentially the offset from where it
> was faulted in.
>
If mremap modifies vm_start, i.e., move_vma, a new VMA will be created.
This corresponds exactly to the third point mentioned later: upgrading
anon_vma_lazy to a regular anon_vma and updating folio->mapping.

mremap时如果修改vm_start,即move_vma则创建新的vma,这正是我后边第三点说的:
将anon_vma_lazy升级成regular anon_vma并修改folio->mapping。

> >
> > We refer to this value as:
> >
> > vma_mapping_base(vma) = vma->vm_start - vma->vm_pgoff * PAGE_SIZE
>
> This is mysteriously close to being the offset I mention in my CoW context
> work...
>
> I'm not sure what 'mapping base' means here.
>

vma_addrss(vma, pgoff, nr_pages)
= vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT)
= vma->vm_start + ((pgoff - vma->vm_pgoff) * PAGE_SIZE)
= vma->vm_start - vma->vm_pgoff * PAGE_SIZE + pgoff * PAGE_SIZE
= vma_mapping_base(vma) + pgoff * PAGE_SIZE

vma_mapping_base depends only on the VMA and is independent of the page.
Alternatively, we could also call it vma_rmap_base.

vma_mapping_base只和vma相关,和page无关,或者我们也可以叫他vma_rmap_base?

> >
> > This value also remains unchanged when the VMA is removed from
> mm_mt.
>
> Why does it matter what this value is on unmap?
>
If root_vma is removed from mm_mt due to munmap, it will still remain
valid as long as other VMAs hold references to it.

root_vma如果被munmap从mm_mt中删除。其他vma持有引用,就仍有效。

> >
> > If a VMA is split and produces new_vma, the following holds:
> >
> > vma_mapping_base(new_vma) == vma_mapping_base(vma)
>
> This is a roundabout way of saying we offset the vma->vm_pgoff after split.
>
> >
> > If two adjacent VMAs vma_a and vma_b are merged into vma_x, then:
> >
> > vma_mapping_base(vma_a) == vma_mapping_base(vma_b) ==
> > vma_mapping_base(vma_x)
>
> This is just a roundabout way of saying the pgoff has to be aligned.
>
> >
> > Assume the VMA where the first page fault occurs is called root_vma,
> > and ensure that any VMA produced by split or merge holds a reference
> > to root_vma.
>
> But this VMA can be unmapped later? Or remapped?
>
It can be unmapped. As mentioned earlier, if mremap modifies vm_start,
a new VMA will be created.

可以被munmap。前边说了mremap如果修改vm_start则创建新的vma。


> Holding on to a VMA and treating it as some kind of canonical reference with
> a reference count completely changes what VMAs are, impacts the VMA
> lifecycle, and produces unwanted memory overhead in itself.
>
During split/merge operations, we can try to preferentially use root_vma
so as to avoid deleting it.

在split/merge时,我们可以尽量优先使用root_vma,避免删除root_vma。

> It also raises concerns and issues around lock order which is very sensitive.
>
Both rmap and fork acquire the PTE lock, which ensures that handling a page
with respect to a particular VMA is atomic.

There is no need to add folio_lock.
When fork converts folio->mapping into a regular anon_vma,
rmap_walk_anon can simply check and retry.

rmap和fork时都要获取pte锁,可以确保rmap/fork在处理page的某个vma是原子的。
不需要增加folio_lock,当fork将folio->mapping变成regular anon_vma后,rmap_walk_anon检查retry即可。


> >
> > During rmap we can compute the folio address using root_vma:
> >
> > vma_address(vma, pgoff, 1) =
>
> What's the parameters here? What's 1?
>
> > vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT)
> > = vma_mapping_base(vma) + pgoff * PAGE_SIZE
> > = vma_mapping_base(root_vma) + folio_pgoff * PAGE_SIZE
> >
> > We can then use folio_addr to locate the VMA covering this folio.
>

I overlooked this earlier. We can unify it by using pgoff as follows.

page_addr = vma_address(vma, pgoff, nr_pages)
= vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT)
= vma->vm_start + ((pgoff - vma->vm_pgoff) * PAGE_SIZE)
= vma->vm_start - vma->vm_pgoff * PAGE_SIZE + pgoff * PAGE_SIZE
= vma_mapping_base(vma) + pgoff * PAGE_SIZE
= vma_mapping_base(root_vma) + pgoff * PAGE_SIZE


> I'm really confused by this, you're kind of mixing and match parameters here.
>
> What I think you're saying is that, if a folio hasn't been remapped, you can
> figure out its address based on page offset.
>
> That's completely broken for MAP_PRIVATE file-backed mappings which also
> use anon_vma and also have to keep on working.
>
> It seems that for the lazy approach what you are doing is essentially caching
> the 'root' VMA in the folio. But this doesn't account for large folios and split
> VMAs.
>
As mentioned earlier:
subpage_address = vma_address(vma_a, subpage_pgoff, 1)
= vma_mapping_base(vma_a) + subpage_pgoff * PAGE_SIZE
= vma_mapping_base(root_vma) + subpage_pgoff * PAGE_SIZE

> Even if you disabled it for those cases (which adds a ton of complexity in
> itself), you then have issues with locking - the anon_vma lock has to take a
> lock (that cannot be a VMA-level lock - results in lock inversion) even on
> these leaf entries, or you break locking.
>
When there is no fork/mremap, we do not need the interval tree or the anon_vma lock.

不fork/mremap时我们不需要interval tree,不需要anon_vma锁。

> And we can't reasonably start pinning VMAs and using them as a sort of
> proto cached thing on top of the existing anon_vma logic.
>

In most cases, root_vma is actively used.
Although it may be removed by munmap, overall it still saves memory.

大部分情况下root_vma都是在被使用的,当然可能被munmap删除,但是整体上节省内存的。

> You also then need to, on remap, undo all this, which requires updating
> folio->mapping on remap, something I tried doing previously myself, but
> that's fraught with issues around lock inversion itself.
>
> >
> > 3. mremap / uffd_move
>
> userfaultfd moving is not relevant as it actually updates the folio correctly.
>
These two operations are different from the previous two types,
as they modify the virtual address of the page/folio.

这两个操作和前两类不同,修改page/folio的虚拟地址。

> >
> > If only the size changes and the start address remains the same, there
> > is no impact.
> >
> > If the start address changes, the page is moved from (vma, addr) to
> > (new_vma, new_addr). In this case:
> >
> > vma_mapping_base(new_vma) =
> > vma_mapping_base(vma) + new_addr - old_addr
>
> You say above that the mapping base never changes? But here it changes?
>

For the newly created new_vma, vma_mapping_base(new_vma) is not equal to vma_mapping_base(vma),
while vma_mapping_base(vma) itself remains unchanged.

新创建的new_vma的vma_mapping_base(new_vma) 不等于vma_mapping_base(vma),但是vma_mapping_base(vma)不变。

> >
> > We first upgrade the VMA, and then fix folio->mapping in move_ptes().
>
> What's 'upgrading' a VMA? You mean converting the lazy anon_vma to a
> 'normal' one.
>
> As above, this is fraught with lock inversion issues.
>
Yes, it upgrades from a lazy_vma to a regular anon_vma.
As mentioned earlier, during this process we hold the mmap write lock, the vma write lock,
and the pte lock, so acquiring the folio_lock is unnecessary.

是的,从lazy_vma升级成regular anon_vma。
如前边所说,这个过程中我们有mmap写锁、vma写锁和pte锁,可以不获取folio_lock。

> >
> > If performance becomes a concern, ANON_VMA_LAZY can be enabled
> only
> > for relatively small VMAs.
>
> I think you've got serious correctness, lock management and complexity
> issues and it's all a non-starter as the costs deeply exceed the benefits.
>

I think the approach is feasible:

1. During merge/split, the newly created vma_a satisfies
vma_mapping_base(vma_a) == vma_mapping_base(vma) ==
vma_mapping_base(root_vma). Therefore, we can use root_vma to
compute the virtual address of the folio/page mapped by vma_a.

2. During fork and mremap, we hold the mmap write lock, the vma
write lock, and the pte lock. In particular, the pte lock ensures
that rmap and fork operations on a folio/page within a specific
vma are atomic. If folio->mapping is upgraded during
rmap_walk_anon(folio), we can simply let rmap_walk_anon retry
once.


我认为方案可行:
1. merge/split时新创建的vma_a有vma_mapping_base(vma_a) == vma_mapping_base(vma) == vma_mapping_base(root_vma)
所以我们可利用root_vma计算vma_a映射的folio/page的虚拟地址。
2. fork和mremap时我们持有mmap写锁、vma写锁和pte锁。
特别的pte锁能确保rmap和fork在folio/page在某个vma上的操作是原子的。
如果rmap_walk_anon(folio)过程中folio->mapping有升级变化,我们让rmap_walk_anon retry一次即可。

> This is one of the fundamental, frustrating aspects of the anon rmap - you
> keep thinking that 'surely' you can do sensible thing X, but it turns out you
> can't for various annoying reasons.
>
> It's one of the reasons it's really fraught for somebody coming to make
> changes, and one of the reasons why I am very keen on fundamentally
> changing it.
>
> And also on a not-wasting-time basis - I was already working in parallel on a
> rework here, so I think the civil thing is to at least wait for my work before
> issuing alternative solutions.
>
> Thanks, Lorenzo
>