Re: [PATCH v3 9/9] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()

From: Liam R. Howlett
Date: Fri Oct 06 2023 - 21:35:12 EST


* Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> [231006 21:11]:
> * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [231005 11:56]:
> >
> >
> > 在 2023/10/5 03:53, Liam R. Howlett 写道:
> > > * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [231004 05:10]:
> > > >
> > > >
> > > > 在 2023/10/4 02:46, Liam R. Howlett 写道:
> > > > > * Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230924 23:58]:
> > > > > > In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
> > > > > > directly replacing the entries of VMAs in the new maple tree can result
> > > > > > in better performance. __mt_dup() uses DFS pre-order to duplicate the
> > > > > > maple tree, so it is very efficient. The average time complexity of
> > > > > > duplicating VMAs is reduced from O(n * log(n)) to O(n). The optimization
> > > > > > effect is proportional to the number of VMAs.
> > > > >
> > > > > I am not confident in the big O calculations here. Although the addition
> > > > > of the tree is reduced, adding a VMA still needs to create the nodes
> > > > > above it - which are a function of n. How did you get O(n * log(n)) for
> > > > > the existing fork?
> > > > >
> > > > > I would think your new algorithm is n * log(n/16), while the
> > > > > previous was n * log(n/16) * f(n). Where f(n) would be something
> > > > > to do with the decision to split/rebalance in bulk insert mode.
> > > > >
> > > > > It's certainly a better algorithm to duplicate trees, but I don't think
> > > > > it is O(n). Can you please explain?
> > > >
> > > > The following is a non-professional analysis of the algorithm.
> > > >
> > > > Let's first analyze the average time complexity of the new algorithm, as
> > > > it is relatively easy to analyze. The maximum number of branches for
> > > > internal nodes in a maple tree in allocation mode is 10. However, to
> > > > simplify the analysis, we will not consider this case and assume that
> > > > all nodes have a maximum of 16 branches.
> > > >
> > > > The new algorithm assumes that there is no case where a VMA with the
> > > > VM_DONTCOPY flag is deleted. If such a case exists, this analysis cannot
> > > > be applied.
> > > >
> > > > The operations of the new algorithm consist of three parts:
> > > >
> > > > 1. DFS traversal of each node in the source tree
> > > > 2. For each node in the source tree, create a copy and construct a new
> > > > node
> > > > 3. Traverse the new tree using mas_find() and replace each element
> > > >
> > > > If there are a total of n elements in the maple tree, we can conclude
> > > > that there are n/16 leaf nodes. Regarding the second-to-last level, we
> > > > can conclude that there are n/16^2 nodes. The total number of nodes in
> > > > the entire tree is given by the sum of n/16 + n/16^2 + n/16^3 + ... + 1.
> > > > This is a geometric progression with a total of log base 16 of n terms.
> > > > According to the formula for the sum of a geometric progression, the sum
> > > > is (n-1)/15. So, this tree has a total of (n-1)/15 nodes and
> > > > (n-1)/15 - 1 edges.
> > > >
> > > > For the operations in the first part of this algorithm, since DFS
> > > > traverses each edge twice, the time complexity would be
> > > > 2*((n-1)/15 - 1).
> > > >
> > > > For the second part, each operation involves copying a node and making
> > > > necessary modifications. Therefore, the time complexity is
> > > > 16*(n-1)/15.
> > > >
> > > > For the third part, we use mas_find() to traverse and replace each
> > > > element, which is essentially similar to the combination of the first
> > > > and second parts. mas_find() traverses all nodes and within each node,
> > > > it iterates over all elements and performs replacements. The time
> > > > complexity of traversing the nodes is 2*((n-1)/15 - 1), and for all
> > > > nodes, the time complexity of replacing all their elements is
> > > > 16*(n-1)/15.
> > > >
> > > > By ignoring all constant factors, each of the three parts of the
> > > > algorithm has a time complexity of O(n). Therefore, this new algorithm
> > > > is O(n).
> > >
> > > Thanks for the detailed analysis! I didn't mean to cause so much work
> > > with this question. I wanted to know so that future work could rely on
> > > this calculation to demonstrate if it is worth implementing without
> > > going through the effort of coding and benchmarking - after all, this
> > > commit message will most likely be examined during that process.
> > >
> > > I asked because O(n) vs O(n*log(n)) doesn't seem to fit with your
> > > benchmarking.
> > It may not be well reflected in the benchmarking of fork() because all
> > the aforementioned time complexity analysis is related to the part
> > involving the maple tree, specifically the time complexity of
> > constructing a new maple tree. However, fork() also includes many other
> > behaviors.
>
> The forking is allocating VMAs, etc but all a 1-1 mapping per VMA so it
> should be linear, if not near-linear. There is some setup time involved
> with the mm struct too, but that should become less as more VMAs are
> added per fork.
>
> > >
> > > >
> > > > The exact time complexity of the old algorithm is difficult to analyze.
> > > > I can only provide an upper bound estimation. There are two possible
> > > > scenarios for each insertion:
> > > >
> > > > 1. Appending at the end of a node.
> > > > 2. Splitting nodes multiple times.
> > > >
> > > > For the first scenario, the individual operation has a time complexity
> > > > of O(1). As for the second scenario, it involves node splitting. The
> > > > challenge lies in determining which insertions trigger splits and how
> > > > many splits occur each time, which is difficult to calculate. In the
> > > > worst-case scenario, each insertion requires splitting the tree's height
> > > > log(n) times. Assuming every insertion is in the worst-case scenario,
> > > > the time complexity would be n*log(n). However, not every insertion
> > > > requires splitting, and the number of splits each time may not
> > > > necessarily be log(n). Therefore, this is an estimation of the upper
> > > > bound.
> > >
> > > Saying every insert causes a split and adding in n*log(n) is more than
> > > an over estimation. At worst there is some n + n/16 * log(n) going on
> > > there.
> > >
> > > During the building of a tree, we are in bulk insert mode. This favours
> > > balancing the tree to the left to maximize the number of inserts being
> > > append operations. The algorithm inserts as many to the left as we can
> > > leaving the minimum number on the right.
> > >
> > > We also reduce the number of splits by pushing data to the left whenever
> > > possible, at every level.
> > Yes, but I don't think pushing data would occur when inserting in
> > ascending order in bulk mode because the left nodes are all full, while
> > there are no nodes on the right side. However, I'm not entirely certain
> > about this since I only briefly looked at the implementation of this
> > part.
>
> They are not full, the right node has enough entries to have a
> sufficient node, so the left node will have that many spaces for push.
> mab_calc_split():
> if (unlikely((mas->mas_flags & MA_STATE_BULK))) {
> *mid_split = 0;
> split = b_end - mt_min_slots[bn->type];
>
> > >
> > >
> > > > >
> > > > > >
> > > > > > As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
> > > > > > fails, there will be a portion of VMAs that have not been duplicated in
> > > > > > the maple tree. This makes it impossible to unmap all VMAs in exit_mmap().
> > > > > > To solve this problem, undo_dup_mmap() is introduced to handle the failure
> > > > > > of dup_mmap(). I have carefully tested the failure path and so far it
> > > > > > seems there are no issues.
> > > > > >
> > > > > > There is a "spawn" in byte-unixbench[1], which can be used to test the
> > > > > > performance of fork(). I modified it slightly to make it work with
> > > > > > different number of VMAs.
> > > > > >
> > > > > > Below are the test results. By default, there are 21 VMAs. The first row
> > > > > > shows the number of additional VMAs added on top of the default. The last
> > > > > > two rows show the number of fork() calls per ten seconds. The test results
> > > > > > were obtained with CPU binding to avoid scheduler load balancing that
> > > > > > could cause unstable results. There are still some fluctuations in the
> > > > > > test results, but at least they are better than the original performance.
> > > > > >
> > > > > > Increment of VMAs: 0 100 200 400 800 1600 3200 6400
> > > > > > next-20230921: 112326 75469 54529 34619 20750 11355 6115 3183
> > > > > > Apply this: 116505 85971 67121 46080 29722 16665 9050 4805
> > > > > > +3.72% +13.92% +23.09% +33.11% +43.24% +46.76% +48.00% +50.96%
> > > delta 4179 10502 12592 11461 8972 5310 2935 1622
> > >
> > > Looking at this data, it is difficult to see what is going on because
> > > there is a doubling of the VMAs per fork per column while the count is
> > > forks per 10 seconds. So this table is really a logarithmic table with
> > > increases growing by 10%. Adding the delta row makes it seem like the
> > > number are not growing apart as I would expect.
> > >
> > > If we normalize this to VMAs per second by dividing the forks by 10,
> > > then multiplying by the number of VMAs we get this:
> > >
> > > VMA Count: 21 121 221 421 821 1621 3221 6421
> > > log(VMA) 1.32 2.00 2.30 2.60 2.90 3.20 3.36 3.81
> > > next-20230921: 258349.8 928268.7 1215996.7 1464383.7 1707725.0 1842916.5 1420514.5 2044440.9
> > > this: 267961.5 1057443.3 1496798.3 1949184.0 2446120.6 2704729.5 2102315.0 3086251.5
> > > delta 9611.7 129174.6 280801.6 484800.3 738395.6 861813.0 681800.5 1041810.6
> > >
> > > The first thing that I noticed was that we hit some dip in the numbers
> > > at 3221. I first thought that might be something else running on the
> > > host machine, but both runs are affected by around the same percent.
> > >
> > > Here, we do see the delta growing apart, but peaking in growth around
> > > 821 VMAs. Again that 3221 number is out of line.
> > >
> > > If we discard 21 and anything above 1621, we still see both lines are
> > > asymptotic curves. I would expect that the new algorithm would be more
> > > linear to represent O(n), but there is certainly a curve when graphed
> > > with a normalized X-axis. The older algorithm, O(n*log(n)) should be
> > > the opposite curve all together, and with a diminishing return, but it
> > > seems the more elements we have, the more operations we can perform in a
> > > second.
> > Thank you for your detailed analysis.
> >
> > So, are you expecting the transformed data to be close to a constant
> > value?
>
> I would expect it to increase linearly, but it's a curve. Also, it
> seems that both methods are near the identical curve, including the dip
> at 3221. I expect the new method to have a different curve, especially
> at the higher numbers where the fork() overhead is much less, but it
> seems they both curve asymptotically. That is, they seen to be the same
> complexity related to n, but with different constants.
>
> > Please note that besides constructing a new maple tree, there are many
> > other operations in fork(). As the number of VMAs increases, the number
> > of fork() calls decreases. Therefore, the overall cost spent on other
> > operations becomes smaller, while the cost spent on duplicating VMAs
> > increases. That's why this data grows with the increase of VMAs. I
> > speculate that if the number of VMAs is large enough to neglect the time
> > spent on other operations in fork(), this data will approach a constant
> > value.
>
> If it were the other parts of fork() causing the non-linear growth, then
> I would expect 800 -> 1600 to increase to +53% instead of +46%, and if
> we were hitting the limit of fork affecting the data, then I would
> expect the delta of VMAs/second to not be so high at the upper 6421 -
> both algorithms have more room to get more performance at least until
> 6421 VMAs/fork.
>
> >
> > If we want to achieve the expected curve, I think we should simulate the
> > process of constructing the maple tree in user space to avoid the impact
> > of other operations in fork(), just like in the current bench_forking().
> > >
> > > Thinking about what is going on here, I cannot come up with a reason
> > > that there would be a curve to the line at all. If we took more
> > > measurements, I would think the samples would be an ever-increasing line
> > > with variability for some function of 16 - a saw toothed increasing
> > > line. At least, until an upper limit is reached. We can see that the
> > > upper limit was still not achieved at 1621 since 6421 is higher for both
> > > runs, but a curve is evident on both methods, which suggests something
> > > else is a significant contributor.
> > >
> > > I would think each VMA requires the same amount of work, so a constant.
> > > The allocations would again, be some function that would linearly
> > > increase with the existing method over-estimating by a huge number of
> > > nodes.
> > >
> > > I'm not trying to nitpick here, but it is important to be accurate in
> > > the statements because it may alter choices on how to proceed in
> > > improving this performance later. It may be others looking through
> > > these commit messages to see if something can be improved.
> > Thank you for pointing that out. I will try to describe it more
> > accurately in the commit log and see if I can measure the expected curve
> > in user space.
> > >
> > > I also feel like your notes on your algorithm are worth including in the
> > > commit because it could prove rather valuable if we revisit forking in
> > > the future.
> > Do you mean that I should write the analysis of the time complexity of
> > the new algorithm in the commit log?
>
> Yes, I think it's worth capturing. What do you think?
>
> > >
> > > The more I look at this, the more questions I have that I cannot answer.
> > > One thing we can see is that the new method is faster in this
> > > micro-benchmark.
> > Yes. It should be noted that in the field of computer science, if the
> > test results don't align with the expected mathematical calculations,
> > it indicates an error in the calculations. This is because accurate
> > calculations will always be reflected in the test results. 😂
> > >
> > > > > >
> > > > > > [1] https://github.com/kdlucas/byte-unixbench/tree/master
> > > > > >
> > > > > > Signed-off-by: Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx>
> > > > > > ---
> > > > > > include/linux/mm.h | 1 +
> > > > > > kernel/fork.c | 34 ++++++++++++++++++++----------
> > > > > > mm/internal.h | 3 ++-
> > > > > > mm/memory.c | 7 ++++---
> > > > > > mm/mmap.c | 52 ++++++++++++++++++++++++++++++++++++++++++++--
> > > > > > 5 files changed, 80 insertions(+), 17 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > index 1f1d0d6b8f20..10c59dc7ffaa 100644
> > > > > > --- a/include/linux/mm.h
> > > > > > +++ b/include/linux/mm.h
> > > > > > @@ -3242,6 +3242,7 @@ extern void unlink_file_vma(struct vm_area_struct *);
> > > > > > extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
> > > > > > unsigned long addr, unsigned long len, pgoff_t pgoff,
> > > > > > bool *need_rmap_locks);
> > > > > > +extern void undo_dup_mmap(struct mm_struct *mm, struct vm_area_struct *vma_end);
> > > > > > extern void exit_mmap(struct mm_struct *);
> > > > > > static inline int check_data_rlimit(unsigned long rlim,
> > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > index 7ae36c2e7290..2f3d83e89fe6 100644
> > > > > > --- a/kernel/fork.c
> > > > > > +++ b/kernel/fork.c
> > > > > > @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > > > int retval;
> > > > > > unsigned long charge = 0;
> > > > > > LIST_HEAD(uf);
> > > > > > - VMA_ITERATOR(old_vmi, oldmm, 0);
> > > > > > VMA_ITERATOR(vmi, mm, 0);
> > > > > > uprobe_start_dup_mmap();
> > > > > > @@ -678,16 +677,25 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > > > goto out;
> > > > > > khugepaged_fork(mm, oldmm);
> > > > > > - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
> > > > > > - if (retval)
> > > > > > + /* Use __mt_dup() to efficiently build an identical maple tree. */
> > > > > > + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
> > > > > > + if (unlikely(retval))
> > > > > > goto out;
> > > > > > mt_clear_in_rcu(vmi.mas.tree);
> > > > > > - for_each_vma(old_vmi, mpnt) {
> > > > > > + for_each_vma(vmi, mpnt) {
> > > > > > struct file *file;
> > > > > > vma_start_write(mpnt);
> > > > > > if (mpnt->vm_flags & VM_DONTCOPY) {
> > > > > > + mas_store_gfp(&vmi.mas, NULL, GFP_KERNEL);
> > > > > > +
> > > > > > + /* If failed, undo all completed duplications. */
> > > > > > + if (unlikely(mas_is_err(&vmi.mas))) {
> > > > > > + retval = xa_err(vmi.mas.node);
> > > > > > + goto loop_out;
> > > > > > + }
> > > > > > +
> > > > > > vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> > > > > > continue;
> > > > > > }
> > > > > > @@ -749,9 +757,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > > > if (is_vm_hugetlb_page(tmp))
> > > > > > hugetlb_dup_vma_private(tmp);
> > > > > > - /* Link the vma into the MT */
> > > > > > - if (vma_iter_bulk_store(&vmi, tmp))
> > > > > > - goto fail_nomem_vmi_store;
> > > > > > + /*
> > > > > > + * Link the vma into the MT. After using __mt_dup(), memory
> > > > > > + * allocation is not necessary here, so it cannot fail.
> > > > > > + */
> > > > > > + mas_store(&vmi.mas, tmp);
> > > > > > mm->map_count++;
> > > > > > if (!(tmp->vm_flags & VM_WIPEONFORK))
> > > > > > @@ -760,15 +770,19 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > > > if (tmp->vm_ops && tmp->vm_ops->open)
> > > > > > tmp->vm_ops->open(tmp);
> > > > > > - if (retval)
> > > > > > + if (retval) {
> > > > > > + mpnt = vma_next(&vmi);
> > > > > > goto loop_out;
> > > > > > + }
> > > > > > }
> > > > > > /* a new mm has just been created */
> > > > > > retval = arch_dup_mmap(oldmm, mm);
> > > > > > loop_out:
> > > > > > vma_iter_free(&vmi);
> > > > > > - if (!retval)
> > > > > > + if (likely(!retval))
> > > > > > mt_set_in_rcu(vmi.mas.tree);
> > > > > > + else
> > > > > > + undo_dup_mmap(mm, mpnt);
> > > > > > out:
> > > > > > mmap_write_unlock(mm);
> > > > > > flush_tlb_mm(oldmm);
> > > > > > @@ -778,8 +792,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > > > uprobe_end_dup_mmap();
> > > > > > return retval;
> > > > > > -fail_nomem_vmi_store:
> > > > > > - unlink_anon_vmas(tmp);
> > > > > > fail_nomem_anon_vma_fork:
> > > > > > mpol_put(vma_policy(tmp));
> > > > > > fail_nomem_policy:
> > > > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > > > index 7a961d12b088..288ec81770cb 100644
> > > > > > --- a/mm/internal.h
> > > > > > +++ b/mm/internal.h
> > > > > > @@ -111,7 +111,8 @@ void folio_activate(struct folio *folio);
> > > > > > void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> > > > > > struct vm_area_struct *start_vma, unsigned long floor,
> > > > > > - unsigned long ceiling, bool mm_wr_locked);
> > > > > > + unsigned long ceiling, unsigned long tree_end,
> > > > > > + bool mm_wr_locked);
> > > > > > void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
> > > > > > struct zap_details;
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 983a40f8ee62..1fd66a0d5838 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -362,7 +362,8 @@ void free_pgd_range(struct mmu_gather *tlb,
> > > > > > void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> > > > > > struct vm_area_struct *vma, unsigned long floor,
> > > > > > - unsigned long ceiling, bool mm_wr_locked)
> > > > > > + unsigned long ceiling, unsigned long tree_end,
> > > > > > + bool mm_wr_locked)
> > > > > > {
> > > > > > do {
> > > > > > unsigned long addr = vma->vm_start;
> > > > > > @@ -372,7 +373,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> > > > > > * Note: USER_PGTABLES_CEILING may be passed as ceiling and may
> > > > > > * be 0. This will underflow and is okay.
> > > > > > */
> > > > > > - next = mas_find(mas, ceiling - 1);
> > > > > > + next = mas_find(mas, tree_end - 1);
> > > > > > /*
> > > > > > * Hide vma from rmap and truncate_pagecache before freeing
> > > > > > @@ -393,7 +394,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> > > > > > while (next && next->vm_start <= vma->vm_end + PMD_SIZE
> > > > > > && !is_vm_hugetlb_page(next)) {
> > > > > > vma = next;
> > > > > > - next = mas_find(mas, ceiling - 1);
> > > > > > + next = mas_find(mas, tree_end - 1);
> > > > > > if (mm_wr_locked)
> > > > > > vma_start_write(vma);
> > > > > > unlink_anon_vmas(vma);
> > > > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > > > index 2ad950f773e4..daed3b423124 100644
> > > > > > --- a/mm/mmap.c
> > > > > > +++ b/mm/mmap.c
> > > > > > @@ -2312,7 +2312,7 @@ static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
> > > > > > mas_set(mas, mt_start);
> > > > > > free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> > > > > > next ? next->vm_start : USER_PGTABLES_CEILING,
> > > > > > - mm_wr_locked);
> > > > > > + tree_end, mm_wr_locked);
> > > > > > tlb_finish_mmu(&tlb);
> > > > > > }
> > > > > > @@ -3178,6 +3178,54 @@ int vm_brk(unsigned long addr, unsigned long len)
> > > > > > }
> > > > > > EXPORT_SYMBOL(vm_brk);
> > > > > > +void undo_dup_mmap(struct mm_struct *mm, struct vm_area_struct *vma_end)
> > > > > > +{
> > > > > > + unsigned long tree_end;
> > > > > > + VMA_ITERATOR(vmi, mm, 0);
> > > > > > + struct vm_area_struct *vma;
> > > > > > + unsigned long nr_accounted = 0;
> > > > > > + int count = 0;
> > > > > > +
> > > > > > + /*
> > > > > > + * vma_end points to the first VMA that has not been duplicated. We need
> > > > > > + * to unmap all VMAs before it.
> > > > > > + * If vma_end is NULL, it means that all VMAs in the maple tree have
> > > > > > + * been duplicated, so setting tree_end to 0 will overflow to ULONG_MAX
> > > > > > + * when using it.
> > > > > > + */
> > > > > > + if (vma_end) {
> > > > > > + tree_end = vma_end->vm_start;
> > > > > > + if (tree_end == 0)
> > > > > > + goto destroy;
> > > > > > + } else
> > > > > > + tree_end = 0;
>
> You need to enclose this statement to meet the coding style. You could
> just set tree_end = 0 at the start of the function instead, actually I
> think tree_end = USER_PGTABLES_CEILING unless there is a vma_end.
>
> > > > > > +
> > > > > > + vma = mas_find(&vmi.mas, tree_end - 1);
>
> vma = vma_find(&vmi, tree_end);
>
> > > > > > +
> > > > > > + if (vma) {
>
> Probably would be cleaner to jump to destroy here too:
> if (!vma)
> goto destroy;
>
> > > > > > + arch_unmap(mm, vma->vm_start, tree_end);

One more thing, it seems the maple state that is passed into
unmap_region() needs to point to the _next_ element, or the reset
doesn't work right between the unmap_vmas() and free_pgtables() call:

vma_iter_set(&vmi, vma->vm_end);


> > > > > > + unmap_region(mm, &vmi.mas, vma, NULL, NULL, 0, tree_end,
> > > > > > + tree_end, true);
> > > > >
> > > > > next is vma_end, as per your comment above. Using next = vma_end allows
> > > > > you to avoid adding another argument to free_pgtables().
> > > > Unfortunately, it cannot be done this way. I fell into this trap before,
> > > > and it caused incomplete page table cleanup. To solve this problem, the
> > > > only solution I can think of right now is to add an additional
> > > > parameter.
> > > >
> > > > free_pgtables() will be called in unmap_region() to free the page table,
> > > > like this:
> > > >
> > > > free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> > > > next ? next->vm_start : USER_PGTABLES_CEILING,
> > > > mm_wr_locked);
> > > >
> > > > The problem is with 'next'. Our 'vma_end' does not exist in the actual
> > > > mmap because it has not been duplicated and cannot be used as 'next'.
> > > > If there is a real 'next', we can use 'next->vm_start' as the ceiling,
> > > > which is not a problem. If there is no 'next' (next is 'vma_end'), we
> > > > can only use 'USER_PGTABLES_CEILING' as the ceiling. Using
> > > > 'vma_end->vm_start' as the ceiling will cause the page table not to be
> > > > fully freed, which may be related to alignment in 'free_pgd_range()'. To
> > > > solve this problem, we have to introduce 'tree_end', and separating
> > > > 'tree_end' and 'ceiling' can solve this problem.
> > >
> > > Can you just use ceiling? That is, just not pass in next and keep the
> > > code as-is? This is how exit_mmap() does it and should avoid any
> > > alignment issues. I assume you tried that and something went wrong as
> > > well?
> > I tried that, but it didn't work either. In free_pgtables(), the
> > following line of code is used to iterate over VMAs:
> > mas_find(mas, ceiling - 1);
> > If next is passed as NULL, ceiling will be 0, resulting in iterating
> > over all the VMAs in the maple tree, including the last portion that was
> > not duplicated.
>
> If vma_end is NULL, it means that all VMAs in the maple tree have been
> duplicated, so shouldn't the correct action in this case be freeing up
> to ceiling?
>
> If it isn't null, then vma_end->vm_start should work as the end of the
> area to free.
>
> With your mas_find(mas, tree_end - 1), then the vma_end will be avoided,
> but free_pgd_range() will use ceiling anyways:
>
> free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling);
>
> Passing in vma_end as next to unmap_region() functions in my testing
> without adding arguments to free_pgtables().
>
> How are you producing the accounting issue you mention above? Maybe I
> missed something?
>
>
> > >
> > > >
> > > > >
> > > > > > +
> > > > > > + mas_set(&vmi.mas, vma->vm_end);
> vma_iter_set(&vmi, vma->vm_end);
> > > > > > + do {
> > > > > > + if (vma->vm_flags & VM_ACCOUNT)
> > > > > > + nr_accounted += vma_pages(vma);
> > > > > > + remove_vma(vma, true);
> > > > > > + count++;
> > > > > > + cond_resched();
> > > > > > + vma = mas_find(&vmi.mas, tree_end - 1);
> > > > > > + } while (vma != NULL);
>
> You can write this as:
> do { ... } for_each_vma_range(vmi, vma, tree_end);
>
> > > > > > +
> > > > > > + BUG_ON(count != mm->map_count);
> > > > > > +
> > > > > > + vm_unacct_memory(nr_accounted);
> > > > > > + }
> > > > > > +
> > > > > > +destroy:
> > > > > > + __mt_destroy(&mm->mm_mt);
> > > > > > +}
> > > > > > +
> > > > > > /* Release all mmaps. */
> > > > > > void exit_mmap(struct mm_struct *mm)
> > > > > > {
> > > > > > @@ -3217,7 +3265,7 @@ void exit_mmap(struct mm_struct *mm)
> > > > > > mt_clear_in_rcu(&mm->mm_mt);
> > > > > > mas_set(&mas, vma->vm_end);
> > > > > > free_pgtables(&tlb, &mas, vma, FIRST_USER_ADDRESS,
> > > > > > - USER_PGTABLES_CEILING, true);
> > > > > > + USER_PGTABLES_CEILING, USER_PGTABLES_CEILING, true);
> > > > > > tlb_finish_mmu(&tlb);
> > > > > > /*
> > > > > > --
> > > > > > 2.20.1
> > > > > >
> > > > >
> > > >
> > >