Re: Maple Tree Work

From: Peng Zhang
Date: Fri Jul 14 2023 - 08:33:41 EST




在 2023/7/13 23:36, Liam R. Howlett 写道:
* Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230713 04:05]:


在 2023/7/12 23:27, Liam R. Howlett 写道:
Dropping Danilo from the Cc..

I was asked to add linux-mm to the list, so I did that as well.

If anyone else is interested in seeing the full list, it's on lkml [1] and
the maple-tree list [2].

Thank you Peng for looking at the list and taking time to think about
the items.

* Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230712 07:49]:


在 2023/7/8 00:38, Liam R. Howlett 写道:
- Fork & Dup tree + Delete DONT_COPY
This is to optimize dup_mmap() in kernel/fork.c, but other
users may want faster duplications of a tree.
This should be faster than building a tree in bulk mode. The
idea is to just copy each node and replace the pointers in,
probably, a BFS order. Once the leaves are reached, the VMAs
will be replaced by the copies made in fork, unless DONT_COPY is
set, in which case the VMA will be deleted from the copied tree.
DONT_COPY is not common and since the tree isn't visible, all
nodes should be available for reuse (no RCU worries).
If DONT_COPY is set, this method will be complicated, because the gaps
adjacent to it need to be merged, and the gaps of all ancestor nodes need to
be updated.

My understanding is that this is a rare event; there aren't many VMAs
marked this way. The store operation already does all the necessary
work for deleting an entry. The gap tracking is also updated, and that
would only happen if the new gap is larger. Are you concerned about the
performance/runtime of handling the DONT_COPY in this way?
Yes, if no DONT_COPY is set, copying all nodes and replacing the
pointers must be the fastest way. I was just thinking if there is
a faster way if DONT_COPY exists. Using the store operation to
delete unnecessary entries will cause additional overhead sometimes.
To give an example:

Suppose a leaf node layout of the maple tree is as follows:
[VMA1][VMA2][NULL][VMA3]

If VMA2 has DONT_COPY set, we need to change the node layout as
follows to delete it:
[VMA1'][NULL][VMA3']

At least we need to copy this node to replace it to make this
delete operation, even need to rebalance. However, this is a
rare case. In most cases, there is no gap between VMAs, so it
will not cause changes in the node layout.

Remember that at this point, there is no readers so we could edit the
node without a copy. It would require new code, but it's just moving
data left.. The bigger worry is a rebalance, as you said, and that can
get complicated. We know we have (more than) enough room to store the
data, but editing in place isn't done in this version of the code. We
do allow for node re-use by pushing back onto the mas->alloc, so the
data requirements won't be too high. Without any readers, the parent
pivots could be changed directly between two leaves.



I have another idea to build a tree, if inserted in order, we only
insert at the leaf node. All leaf nodes are connected using a linked
list. In the end we get a linked list with only leaf nodes. Then we
construct non-leaf nodes layer by layer from bottom to top. I think
this is also faster than bulk mode. Another advantage of this method
is that we are applicable to more scenarios, do not need the original
tree, only need to know the ranges inserted in order. I don't know
how fast this method is, so we can discuss it.

What is the advantage of a linked list over just building the tree as
you go? Considering the non-leaf nodes are just a list of nodes
already, and you will have to do the same work of setting pivots,
allocating nodes, and filling them after you have the linked list.

What work do you avoid that would make a linked list faster than bulk
insert or a tree copy/replace VMAs? I was thinking that we could avoid
a lot of the work involved with splitting/pushing and the parent
construction by using memcpy of each node, replace each slot (and
parent) with a new memcpy of the mirrored tree, then have a minimum
amount of modifications to delete the DONT_COPY during the VMA
replacement. BFS copy would ensure we wouldn't modify the source tree
during VMA replacement and deletion (DONT_COPY). So the rebalancing (in
bulk insert), pivot calculations, metadata, and gaps are (mostly) saved
by using memcpy.
Your analysis is correct.

From what I understand from your linked list idea, we would need to
construct the child node by examining each entry and know that a certain
entry is a DONT_COPY (ie: VMA logic would be needed in the maple tree
code or a callback?). We really can't have VMA logic in the maple tree
code, so we could do some sort of loop through the VMAs to add the entry
to the list if desired.

Once we have a linked list, we still need to figure out each pivot for
the parent (I guess we won't use the last slot so there is a pivot to
check?), and each max gap in each child to construct the upper layers
of the tree. Is this correct?

I guess we would still need to adjust the last node to ensure sufficient
entries as well, so as we add items we may need to rebalance the last
leaf with the second-last leaf.
Yes, the last two leaves need to check to make sure they have enough
items.

The bulk store currently adjusts the split to favour splitting
left, could you get the same result by strictly filling the nodes? This
would have to have special handling to rebalance the last one - which we
have a pretty good idea of when it's going to happen as we have a count
(and the DONT_COPY is rare).

Are you thinking you could compact the tree at the same time to gain
efficiency?

What would you consider a sufficient packed tree? It's best to keep
some space around for expansion/contraction. This works out since, I
think, we would need to keep that last slot free so we have a pivot to
check in your linked list plan. Initial development with strict
split/join rules resulted in a 'jittering' of nodes as the number of
entries in a node shifted just above/below the threshold so we relaxed
the merging of nodes to avoid such a scenario.
We can control the number of entries of nodes, for example, let this
number be (min + max)/2, so as to avoid making a node too 'dense' or
too 'loose'.

By the way, in the VMA case, we also know the number of VMAs in the
tree. Unfortunately, we don't know how many are DONT_COPY VMAs. I
wonder if it would be worth while to balance each VMA with its neighbour
during this operation, at least within one tree level? It would reduce
the possibility of a larger rebalance on a DONT_COPY. It's probably not
worth it because it would slow down our fast path.


Let me know if you would like me to put your name beside the Fork & Dup
Tree item in the list of work.
You can put my name on this one and I'll do it.

Sounds good, thanks!

I use the method of copying all nodes, so I will implement an interface
to get a mirrored tree.

But I can't think of a good way to replace old VMAs, it can't be done
during the copy tree process, because maybe some VMAs have DONT_COPY
flag.

I think it could be done during the copy, instead of a 'replace' it
would be a 'delete'. I think this is why we need to use a BFS-like
duplication. So once we reach the leaves, then we can modify the tree
knowing that the above state has already been copied and so it's going
to be altering a copy of the data and so we are at a point where it can
be mutated. You could detect that the lock-step next/insert is out of
sync and delete the range between the old_mas and the new_mas.

It seems that we can only scan all VMAs in the source tree again
to update the new tree. We have already traversed the source tree once
in the process of copying the tree. Is there any way to avoid traversing
it again?

Well, we haven't visited the VMAs in the copy, but we have visited the
leaf nodes with all the pointers to the VMAs. I get what you are
saying thought, we will have to duplicate the leaves then re-visit the
leaves to replace the VMAs. I am not sure we can avoid this since a
rebalance may occur, and it would be very tricky to rebalance with old
and new data in the tree - it's probably best to just revisit, at least
to begin with.

Depending on how you implement it, you could make the copying of the
tree end on the first leaf node by using the height of the tree to
figure out which way to sweep (left to right or right to left) on the
first level. Not strictly BFS, but you could end the maple state in the
correct location to start replacing the VMAs. Would that work?

We also have a reverse iterator, so we could just run through the tree
from the right-most node to the start.

I was thinking that we would make the first 'duplication store' detect
an empty tree and duplicate the tree, ending on the left-most leaf and
then replace the first entry (and possibly delete up to the first
store). Each subsequent store would do the same. We would need a
'duplication complete' that would remove any entries beyond the last
store and rebalance, if necessary.

Feel free to use some or none of the ideas. I hope some of this helps
with what you are trying to work out. Let me know if you have any more
questions.
Thank you for providing so much information, I am doing this feature.

Thanks,
Peng

Thanks,
Liam