[PATCH 0/2] Introducing the eXtensible Array (xarray)

From: Matthew Wilcox
Date: Tue Feb 28 2017 - 13:14:41 EST


From: Matthew Wilcox <mawilcox@xxxxxxxxxxxxx>

I wrote the xarray to replace the radix tree with a better API based
on observing how programmers are currently using the radix tree, and
on how (and why) they aren't. Conceptually, an xarray is an array of
2^BITS_PER_LONG pointers which is initially full of NULL pointers.

This is the result of about two weeks worth of hacking. There is much
more to do; I think the most important thing remaining is to start
working on converting the page cache from the radix tree to the xarray.
That will tell me where the API could be better. I'm sending it out
now for feedback; I know there are bugs and some half-finished code,
but I'm interested in comments on the design direction.

Improvements the xarray has over the radix tree:

- The radix tree provides operations like other trees do; 'insert' and
'delete'. But what users really want is an automatically resizing
array, and so it makes more sense to give users an API that is like
an array -- 'load' and 'store'.
- Locking is part of the API. This simplifies a lot of users who
formerly had to manage their own locking just for the radix tree.
It also improves code generation as we can now tell RCU that we're
holding a lock and it doesn't need to generate as much fencing code.
The other advantage is that tree nodes can be moved (not yet
implemented).
- GFP flags are now parameters to calls which may need to allocate
memory. The radix tree forced users to decide what the allocation
flags would be at creation time. It's much clearer to specify them
at allocation time. I know the MM people disapprove of the radix
tree using the top bits of the GFP flags for its own purpose, so
they'll like this aspect.
- Memory is not preloaded; we don't tie up dozens of pages on the
off chance that the slab allocator fails. Instead, we drop the lock,
allocate a new node and retry the operation.
- The xarray provides a conditional-replace operation. The radix tree
forces users to roll their own (and at least four have).
- Iterators now take a 'max' parameter. That simplifies many users and
will reduce the amount of iteration done.
- Iteration can proceed backwards. We only have one user for this, but
since it's called as part of the pagefault readahead algorithm, that
seemed worth mentioning. Not fully implemented.
- RCU-protected pointers are not exposed as part of the API. There are
some fun bugs where the page cache forgets to use rcu_dereference()
in the current codebase.
- Any function which wants it can now call the update_node() callback.
There were a few places missing that I noticed as part of this rewrite.
- Help the page cache to keep nrexceptional and nrpages up to date.
- Exceptional entries may now be BITS_PER_LONG-1 in size, rather than the
BITS_PER_LONG-2 that they had in the radix tree. That gives us the
extra bit we need to put huge page swap entries in the page cache.

The API comes in two parts, normal and advanced. The normal API takes
care of the locking and memory allocation for you. You can get the
value of a pointer by calling xa_load() and set the value of a pointer by
calling xa_store(). You can conditionally update the value of a pointer
by calling xa_replace(). Each pointer which isn't NULL can be tagged
with up to 3 bits of extra information, accessed through xa_get_tag(),
xa_set_tag() and xa_clear_tag(). You can copy batches of pointers out
of the array by calling xa_get_entries() or xa_get_tagged(). You can
iterate over pointers in the array by calling xa_find(), xa_next()
or xa_for_each().

The advanced API allows users to build their own operations. You have
to take care of your own locking and handle memory allocation failures.
Most of the advanced operations are based around the xa_state which
keeps state between sub-operations. Read the xarray.h header file for
more information on the advanced API, and see the implementation of the
normal API for examples of how to use the advanced API.

Those familiar with the radix tree may notice certain similarities between
the implementation of the xarray and the radix tree. That's entirely
intentional, but the implementation will certainly adapt in the future.
For example, one of the impediments I see to using xarrays instead of
kvmalloced arrays is memory consumption, so I have a couple of ideas to
reduce memory usage for smaller arrays.

I have already reimplementated the IDR and the IDA based on the xarray.
They are roughly the same complexity as they were when implemented on
top of the radix tree (although much less intertwined).

When converting code from the radix tree to the xarray, the biggest thing
to bear in mind is that 'store' overwrites anything which happens to be
in the xarray. Just like the assignment operator. The equivalent to
the insert operation is to replace NULL with the new value.

A quick reference guide to help when converting radix tree code.
Functions which start 'xas' are XA_ADVANCED functions.

radix_tree_empty xa_empty
__radix_tree_create xas_create
__radix_tree_insert xas_store
radix_tree_insert(x) xa_replace(NULL, x)
__radix_tree_lookup xas_load
radix_tree_lookup xa_load
radix_tree_lookup_slot xas_load
__radix_tree_replace xas_store
radix_tree_iter_replace xas_store
radix_tree_replace_slot xas_store
__radix_tree_delete_node xas_delete_node
radix_tree_delete_item xa_replace
radix_tree_delete xa_store(NULL)
radix_tree_clear_tags xas_clear_tags
radix_tree_gang_lookup xa_get_entries
radix_tree_gang_lookup_slot xas_find (*1)
radix_tree_preload (*3)
radix_tree_maybe_preload (*3)
radix_tree_tag_set xa_set_tag
radix_tree_tag_clear xa_clear_tag
radix_tree_tag_get xa_get_tag
radix_tree_iter_tag_set xas_set_tag
radix_tree_gang_lookup_tag xa_get_tagged
radix_tree_gang_lookup_tag_slot xas_load (*2)
radix_tree_tagged xa_tagged
radix_tree_preload_end xas_preload_end
radix_tree_split_preload (*3)
radix_tree_split xas_split (*4)
radix_tree_join xa_replace

(*1) All three users of radix_tree_gang_lookup_slot() are using it to
ensure that there are no entries in a given range.
(*2) The one radix_tree_gang_lookup_tag_slot user should be using a
radix_tree_iter loop. It can use an xas_for_each() loop, or even an
xa_for_each() loop.
(*3) I don't think we're going to need a preallocation API. If we do
end up needing one, I have a plan that doesn't involve per-cpu
preallocation pools.
(*4) Not yet implemented


Matthew Wilcox (2):
Add XArray
XArray: Convert IDR and add test suite

include/linux/idr.h | 136 +--
include/linux/xarray.h | 764 ++++++++++++++
lib/Makefile | 3 +-
lib/idr.c | 503 +++++----
lib/radix-tree.c | 169 +--
lib/xarray.c | 1305 ++++++++++++++++++++++++
tools/include/asm/bug.h | 4 +
tools/include/linux/kernel.h | 1 +
tools/include/linux/spinlock.h | 7 +-
tools/testing/radix-tree/.gitignore | 4 +-
tools/testing/radix-tree/Makefile | 34 +-
tools/testing/radix-tree/{idr-test.c => idr.c} | 37 +-
tools/testing/radix-tree/linux.c | 2 +-
tools/testing/radix-tree/linux/radix-tree.h | 5 -
tools/testing/radix-tree/linux/rcupdate.h | 2 +
tools/testing/radix-tree/linux/xarray.h | 1 +
tools/testing/radix-tree/test.c | 59 +-
tools/testing/radix-tree/test.h | 48 +-
tools/testing/radix-tree/xarray.c | 241 +++++
19 files changed, 2783 insertions(+), 542 deletions(-)
create mode 100644 include/linux/xarray.h
create mode 100644 lib/xarray.c
rename tools/testing/radix-tree/{idr-test.c => idr.c} (91%)
create mode 100644 tools/testing/radix-tree/linux/xarray.h
create mode 100644 tools/testing/radix-tree/xarray.c

--
2.11.0