Re: [PATCH v3 3/9] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()

From: Peng Zhang
Date: Thu Oct 05 2023 - 12:11:31 EST




在 2023/10/4 22:25, Liam R. Howlett 写道:
* Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [231004 05:09]:


在 2023/10/4 02:45, Liam R. Howlett 写道:
* Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx> [230924 23:58]:
Introduce interfaces __mt_dup() and mtree_dup(), which are used to
duplicate a maple tree. They duplicate a maple tree in Depth-First
Search (DFS) pre-order traversal. It uses memcopy() to copy nodes in the
source tree and allocate new child nodes in non-leaf nodes. The new node
is exactly the same as the source node except for all the addresses
stored in it. It will be faster than traversing all elements in the
source tree and inserting them one by one into the new tree. The time
complexity of these two functions is O(n).

The difference between __mt_dup() and mtree_dup() is that mtree_dup()
handles locks internally.

Signed-off-by: Peng Zhang <zhangpeng.00@xxxxxxxxxxxxx>
---
include/linux/maple_tree.h | 3 +
lib/maple_tree.c | 286 +++++++++++++++++++++++++++++++++++++
2 files changed, 289 insertions(+)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 666a3764ed89..de5a4056503a 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -329,6 +329,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
void *entry, gfp_t gfp);
void *mtree_erase(struct maple_tree *mt, unsigned long index);
+int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
+int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
+
void mtree_destroy(struct maple_tree *mt);
void __mt_destroy(struct maple_tree *mt);
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 3fe5652a8c6c..ed8847b4f1ff 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -6370,6 +6370,292 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
}
EXPORT_SYMBOL(mtree_erase);
+/*
+ * mas_dup_free() - Free an incomplete duplication of a tree.
+ * @mas: The maple state of a incomplete tree.
+ *
+ * The parameter @mas->node passed in indicates that the allocation failed on
+ * this node. This function frees all nodes starting from @mas->node in the
+ * reverse order of mas_dup_build(). There is no need to hold the source tree
+ * lock at this time.
+ */
+static void mas_dup_free(struct ma_state *mas)
+{
+ struct maple_node *node;
+ enum maple_type type;
+ void __rcu **slots;
+ unsigned char count, i;
+
+ /* Maybe the first node allocation failed. */
+ if (mas_is_none(mas))
+ return;
+
+ while (!mte_is_root(mas->node)) {
+ mas_ascend(mas);
+
+ if (mas->offset) {
+ mas->offset--;
+ do {
+ mas_descend(mas);
+ mas->offset = mas_data_end(mas);
+ } while (!mte_is_leaf(mas->node));
+
+ mas_ascend(mas);
+ }
+
+ node = mte_to_node(mas->node);
+ type = mte_node_type(mas->node);
+ slots = ma_slots(node, type);
+ count = mas_data_end(mas) + 1;
+ for (i = 0; i < count; i++)
+ ((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
+
+ mt_free_bulk(count, slots);
+ }
+
+ node = mte_to_node(mas->node);
+ mt_free_one(node);
+}
+
+/*
+ * mas_copy_node() - Copy a maple node and replace the parent.
+ * @mas: The maple state of source tree.
+ * @new_mas: The maple state of new tree.
+ * @parent: The parent of the new node.
+ *
+ * Copy @mas->node to @new_mas->node, set @parent to be the parent of
+ * @new_mas->node. If memory allocation fails, @mas is set to -ENOMEM.
+ */
+static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
+ struct maple_pnode *parent)
+{
+ struct maple_node *node = mte_to_node(mas->node);
+ struct maple_node *new_node = mte_to_node(new_mas->node);
+ unsigned long val;
+
+ /* Copy the node completely. */
+ memcpy(new_node, node, sizeof(struct maple_node));
+
+ /* Update the parent node pointer. */
+ val = (unsigned long)node->parent & MAPLE_NODE_MASK;
+ new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
+}
+
+/*
+ * mas_dup_alloc() - Allocate child nodes for a maple node.
+ * @mas: The maple state of source tree.
+ * @new_mas: The maple state of new tree.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * This function allocates child nodes for @new_mas->node during the duplication
+ * process. If memory allocation fails, @mas is set to -ENOMEM.
+ */
+static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
+ gfp_t gfp)
+{
+ struct maple_node *node = mte_to_node(mas->node);
+ struct maple_node *new_node = mte_to_node(new_mas->node);
+ enum maple_type type;
+ unsigned char request, count, i;
+ void __rcu **slots;
+ void __rcu **new_slots;
+ unsigned long val;
+
+ /* Allocate memory for child nodes. */
+ type = mte_node_type(mas->node);
+ new_slots = ma_slots(new_node, type);
+ request = mas_data_end(mas) + 1;
+ count = mt_alloc_bulk(gfp, request, (void **)new_slots);
+ if (unlikely(count < request)) {
+ if (count) {
+ mt_free_bulk(count, new_slots);

If you look at mm/slab.c: kmem_cache_alloc(), you will see that the
error path already bulk frees for you - but does not zero the array.
This bulk free will lead to double free, but you do need the below
memset(). Also, it will return !count or request. So, I think this code
is never executed as it is written.
If kmem_cache_alloc() is called to allocate memory in mt_alloc_bulk(),
then this code will not be executed because it only returns 0 or
request. However, I am concerned that changes to mt_alloc_bulk() like
[1] may be merged, which could potentially lead to memory leaks. To
improve robustness, I wrote it this way.

How do you think it should be handled? Is it okay to do this like the
code below?

if (unlikely(count < request)) {
memset(new_slots, 0, request * sizeof(unsigned long));
mas_set_err(mas, -ENOMEM);
return;
}

[1] https://lore.kernel.org/lkml/20230810163627.6206-13-vbabka@xxxxxxx/

Ah, I see.

We should keep the same functionality as before. The code you are
referencing is an RFC and won't be merged as-is. We should be sure to
keep an eye on this happening.

I think the code you have there is correct.


I don't think this will show up in your testcases because the test code
doesn't leave dangling pointers and simply returns 0 if there isn't
enough nodes.
Yes, no testing here.

Yeah :/ I think we should update the test code at some point to behave
the same as the real code. Don't worry about it here though.
If we want to test this here, we need to modify the
kmem_cache_alloc_bulk() in the user space to allocate a portion of
memory. This will cause it to behave differently from the corresponding
function in the kernel space. I'm not sure if this modification is
acceptable.

Also, I might need to move the memset() outside of the if
statement (if (unlikely(count < request)){}) to use it for cleaning up
residual pointers.



+ memset(new_slots, 0, count * sizeof(unsigned long));
+ }
+ mas_set_err(mas, -ENOMEM);
+ return;
+ }
+
+ /* Restore node type information in slots. */
+ slots = ma_slots(node, type);
+ for (i = 0; i < count; i++) {
+ val = (unsigned long)mt_slot_locked(mas->tree, slots, i);
+ val &= MAPLE_NODE_MASK;
+ ((unsigned long *)new_slots)[i] |= val;
+ }
+}
+
+/*
+ * mas_dup_build() - Build a new maple tree from a source tree
+ * @mas: The maple state of source tree.
+ * @new_mas: The maple state of new tree.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * This function builds a new tree in DFS preorder. If the memory allocation
+ * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
+ * last node. mas_dup_free() will free the incomplete duplication of a tree.
+ *
+ * Note that the attributes of the two trees need to be exactly the same, and the
+ * new tree needs to be empty, otherwise -EINVAL will be set in @mas.
+ */
+static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
+ gfp_t gfp)
+{
+ struct maple_node *node;
+ struct maple_pnode *parent = NULL;
+ struct maple_enode *root;
+ enum maple_type type;
+
+ if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
+ unlikely(!mtree_empty(new_mas->tree))) {

Would it be worth checking mas_is_start() for both mas and new_mas here?
Otherwise mas_start() will not do what you want below. I think it is
implied that both are at MAS_START but never checked?
This function is an internal function and is currently only called by
{mtree,__mt}_dup(). It is ensured that both 'mas' and 'new_mas' are
MAS_START when called. Do you think we really need to check it? Maybe we
just need to explain it in the comments?

Yes, just document that it is expected to be MAS_START.


+ mas_set_err(mas, -EINVAL);
+ return;
+ }
+
+ mas_start(mas);
+ if (mas_is_ptr(mas) || mas_is_none(mas)) {
+ root = mt_root_locked(mas->tree);
+ goto set_new_tree;
+ }
+
+ node = mt_alloc_one(gfp);
+ if (!node) {
+ new_mas->node = MAS_NONE;
+ mas_set_err(mas, -ENOMEM);
+ return;
+ }
+
+ type = mte_node_type(mas->node);
+ root = mt_mk_node(node, type);
+ new_mas->node = root;
+ new_mas->min = 0;
+ new_mas->max = ULONG_MAX;
+ root = mte_mk_root(root);
+
+ while (1) {
+ mas_copy_node(mas, new_mas, parent);
+
+ if (!mte_is_leaf(mas->node)) {
+ /* Only allocate child nodes for non-leaf nodes. */
+ mas_dup_alloc(mas, new_mas, gfp);
+ if (unlikely(mas_is_err(mas)))
+ return;
+ } else {
+ /*
+ * This is the last leaf node and duplication is
+ * completed.
+ */
+ if (mas->max == ULONG_MAX)
+ goto done;
+
+ /* This is not the last leaf node and needs to go up. */
+ do {
+ mas_ascend(mas);
+ mas_ascend(new_mas);
+ } while (mas->offset == mas_data_end(mas));
+
+ /* Move to the next subtree. */
+ mas->offset++;
+ new_mas->offset++;
+ }
+
+ mas_descend(mas);
+ parent = ma_parent_ptr(mte_to_node(new_mas->node));
+ mas_descend(new_mas);
+ mas->offset = 0;
+ new_mas->offset = 0;
+ }
+done:
+ /* Specially handle the parent of the root node. */
+ mte_to_node(root)->parent = ma_parent_ptr(mas_tree_parent(new_mas));
+set_new_tree:
+ /* Make them the same height */
+ new_mas->tree->ma_flags = mas->tree->ma_flags;
+ rcu_assign_pointer(new_mas->tree->ma_root, root);
+}
+
+/**
+ * __mt_dup(): Duplicate a maple tree
+ * @mt: The source maple tree
+ * @new: The new maple tree
+ * @gfp: The GFP_FLAGS to use for allocations
+ *
+ * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
+ * traversal. It uses memcopy() to copy nodes in the source tree and allocate
+ * new child nodes in non-leaf nodes. The new node is exactly the same as the
+ * source node except for all the addresses stored in it. It will be faster than
+ * traversing all elements in the source tree and inserting them one by one into
+ * the new tree.
+ * The user needs to ensure that the attributes of the source tree and the new
+ * tree are the same, and the new tree needs to be an empty tree, otherwise
+ * -EINVAL will be returned.
+ * Note that the user needs to manually lock the source tree and the new tree.
+ *
+ * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
+ * the attributes of the two trees are different or the new tree is not an empty
+ * tree.
+ */
+int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
+{
+ int ret = 0;
+ MA_STATE(mas, mt, 0, 0);
+ MA_STATE(new_mas, new, 0, 0);
+
+ mas_dup_build(&mas, &new_mas, gfp);
+
+ if (unlikely(mas_is_err(&mas))) {
+ ret = xa_err(mas.node);
+ if (ret == -ENOMEM)
+ mas_dup_free(&new_mas);
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(__mt_dup);
+
+/**
+ * mtree_dup(): Duplicate a maple tree
+ * @mt: The source maple tree
+ * @new: The new maple tree
+ * @gfp: The GFP_FLAGS to use for allocations
+ *
+ * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
+ * traversal. It uses memcopy() to copy nodes in the source tree and allocate
+ * new child nodes in non-leaf nodes. The new node is exactly the same as the
+ * source node except for all the addresses stored in it. It will be faster than
+ * traversing all elements in the source tree and inserting them one by one into
+ * the new tree.
+ * The user needs to ensure that the attributes of the source tree and the new
+ * tree are the same, and the new tree needs to be an empty tree, otherwise
+ * -EINVAL will be returned.

The requirement to duplicate the entire tree should be mentioned and
maybe the mas_is_start() requirement (as I asked about above?)
Okay, I will add a comment saying 'This duplicates the entire tree'. But
'mas_is_start()' is not a requirement for calling this function because
the function's parameter is 'maple_tree', not 'ma_state'. I think
'mas_is_start()' should be added to the comment for 'mas_dup_build()'.

Oh right, thanks.


I can see someone thinking they are going to make a super fast sub-tree
of existing data using this - which won't (always?) work.

+ *
+ * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
+ * the attributes of the two trees are different or the new tree is not an empty
+ * tree.
+ */
+int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
+{
+ int ret = 0;
+ MA_STATE(mas, mt, 0, 0);
+ MA_STATE(new_mas, new, 0, 0);
+
+ mas_lock(&new_mas);
+ mas_lock_nested(&mas, SINGLE_DEPTH_NESTING);
+
+ mas_dup_build(&mas, &new_mas, gfp);
+ mas_unlock(&mas);
+
+ if (unlikely(mas_is_err(&mas))) {
+ ret = xa_err(mas.node);
+ if (ret == -ENOMEM)
+ mas_dup_free(&new_mas);
+ }
+
+ mas_unlock(&new_mas);
+
+ return ret;
+}
+EXPORT_SYMBOL(mtree_dup);
+
/**
* __mt_destroy() - Walk and free all nodes of a locked maple tree.
* @mt: The maple tree
--
2.20.1