Maple Tree Work 2023/09

From: Liam R. Howlett
Date: Tue Sep 12 2023 - 13:29:06 EST


Hello,

This months update of the maple tree development and work items follows.
I send out an updated list monthly with peoples names against work items
to avoid work duplication, so please don't work on something with a name
next to the work item.

If you want to work on something then please respond to this email so
everyone knows, and I can add your name to the item. Feel free to ask
questions to clarify the feature or discuss directions.

Likewise, if you cannot work on anything anymore then let me know so
others can.

If you have any ideas, then please email the list. We can discuss it
and I can add it to the list.

Please sign up to the maple tree mailing list [1] to get all updates.

Features:
- Better preallocation calculations - Liam R. Howlett 2023/07/07
Without tracking the tree status on the walk down, we can
partially walk down to figure out the type of write and which
'worst case' a write will cause. Using this "write type"
information, the preallocations can be a better estimate. v3 was
sent to the mailing list [2] and is in v6.6-rc1.

- mas_store_gfp() align with mas_prealloc()
Right now mas_store_gfp() is going to allocate once it figures
out what to do and it does the worst case allocations. This is
inefficient and can easily be done better by doing more of what
mas_prealloc()/mas_store_prealloc() does - or at least will be
doing once the 'Better preallocation calculations' lands.

- Tracking tree status on walks down
Track in the maple state when the last minimum nodes, space for
2, or space for 3 slots occurred (height in the tree). This
will allow for a better estimate on how many nodes are needed
for a write. This can be complicated on next/prev walking, etc.
It has to be done in read mode since some walk may have been
done before deciding to write - see mmap_region(), for example.

- Store type (enum for store type?)
Extending the "Tracking tree status on walks down", the
information can then be used to determine what operation is
needed during a mas_prealloct(), etc. The operation can be
stored in the maple state to continue on a mas_store_prealloc().
Obviously, this would work out well if we have mas_store_gfp()
using the same functionality as mentioned above.

- Full count/flag & Dense nodes
There is a bit that exists in the pointer reserved to indicate
there is no NULLs in the leaves below. This feature is mainly
for trees that are non-alloc trees. We can then know there is
at least one singleton that can be stored below this entry.
This is coupled with restoring Dense nodes as a potential node
type. The tricky part is deciding on when to switch to dense
nodes/back from dense nodes (all entries to dense nodes must be
singletons!). See mte_set_full(), mte_clear_full(),
mte_has_null() which use MAPLE_ENODE_NULL.

- Fork & Dup tree + Delete DONT_COPY - Peng Zhang 2023/07/12
This is to optimize dup_mmap() in kernel/fork.c, but other
users may want faster duplications of a tree.
This should be faster than building a tree in bulk mode. The
idea is to just copy each node and replace the pointers in,
probably, a BFS order. Once the leaves are reached, the VMAs
will be replaced by the copies made in fork, unless DONT_COPY is
set, in which case the VMA will be deleted from the copied tree.
DONT_COPY is not common and since the tree isn't visible, all
nodes should be available for reuse (no RCU worries). v2 was
sent to the mailing list [3].

- Push reuse parent
During an insert, new nodes can be "pushed" left or right -
see mas_push_data(). On a left push, it may be possible to
reuse the node by making the write like an 'append'. This may
also be possible to be done during a split operation when the
write extends the last slot. This needs examining to ensure RCU
safety.

- Contiguous iterator
Some users want to iterate across a range of items, but only
contiguous items. Sometimes a gap within the range is
considered an error while other times they are not. Good
targets for this feature are mlock() and mprotect() system
calls. The goal is to provide an easy to use interface that
reduces the complexity of the external code that keeps track of
the contiguous nature of the iteration.

- Remove BIG_NODE
maple tree nodes are rebalanced, split, and updated using a
maple_big_node struct. This structure is over twice the size of
regular nodes to ensure there is enough space to contain the
data. Ideally the same feats can be accomplished by using
regular sized nodes.

- Test & Fix destroy_rebalance
Destroy_rebalance needs better testing when it occurs in rcu
mode. Currently it is not used in this mode, but it should be
supported. This part of the code needs better test coverage and
fix any necessary issues that arise with rcu readers during
updates.

- More userspace testing - Liam R. Howlett 2023/09/12
Add userspace fuzzer testing to the test suite. There was a
patch sent out to create a basic clang fuzzer. Using something
along those lines, write a fuzzer to test the maple tree.

- Live locking with delays for testing
Add some sort of functional delay operation to work with testing
to reduce potential locking issues, especially with CPU
contention.

- mas_for_each caching of end of node
Adding metadata for the end of node was a big performance win.
Extend this by caching the end of the node while iterating (to
start). Later this could be expanded to other iterators.

- Add min split span to parents
When nodes are split, the parent node splits by number of
entries. It may be better to split by range as it would help
avoid splitting more frequently on inserts.

- Search Marks
To complete the features that xarray support, the maple tree
needs to support search marks. Marks will replace the gap
tracking in a new node type complete with searching for a
specific tag from the top-level to the leaves; just like the
gaps that exist in the allocation nodes. It will probably need
two new node types (range and dense). Dense with 29 pointers,
range64 with 15 pointers.

Larger Features:
- 64 bit stores on 32 bit arch
A number of users want to use the maple tree, but cannot because
they need a storage medium that supports 64 bit stores on 32 bit
arch.

- wr_mas with alloc instead of mas
Internally, we have a write maple state, but the maple state
currently holds the allocations. I'm not sure if this is worth
it, and it will probably need "Tracking tree status on walks
down" so that preallocations/allocations can be accurate. Other
considerations are how mas_prealloc() would work with this
change.

- Big Dense Node Type
There are various node types that could be added to the maple
tree. One of the most useful is probably a 'big dense' with the
node being a 4k block of singletons. This would have to come
after the dense node type as dense enables more users and helps
scope this work.

- Judy Array
The Judy Array is another fancy data structure. Mathieu
Desnoyers has enquired and spoken to us about incorporating judy
arrays as a node type or other features he has in the judy
array. This is in the early thought stages.

Please note that any new features will need test cases added to ensure
they function correctly.

[1] https://lists.infradead.org/mailman/listinfo/maple-tree
[2] http://lists.infradead.org/pipermail/maple-tree/2023-July/002736.html
[3] https://lore.kernel.org/linux-mm/20230830125654.21257-1-zhangpeng.00@xxxxxxxxxxxxx/