Re: [PATCH v4 00/73] XArray version 4
From: Matthew Wilcox
Date: Wed Dec 06 2017 - 19:13:54 EST
On Wed, Dec 06, 2017 at 04:58:29PM -0700, Ross Zwisler wrote:
> Maybe I missed this from a previous version, but can you explain the
> motivation for replacing the radix tree with an xarray? (I think this should
> probably still be part of the cover letter?) Do we have a performance problem
> we need to solve? A code complexity issue we need to solve? Something else?
Sure! Something else I screwed up in the v4 announcement ... I'll
need it again for v5, so here's a quick update of the v1 announcement's
justification:
I wrote the xarray to replace the radix tree with a better API based
on observing how programmers are currently using the radix tree, and
on how (and why) they aren't. Conceptually, an xarray is an array of
ULONG_MAX pointers which is initially full of NULL pointers.
Improvements the xarray has over the radix tree:
- The radix tree provides operations like other trees do; 'insert' and
'delete'. But what users really want is an automatically resizing
array, and so it makes more sense to give users an API that is like
an array -- 'load' and 'store'.
- Locking is part of the API. This simplifies a lot of users who
formerly had to manage their own locking just for the radix tree.
It also improves code generation as we can now tell RCU that we're
holding a lock and it doesn't need to generate as much fencing code.
The other advantage is that tree nodes can be moved (not yet
implemented).
- GFP flags are now parameters to calls which may need to allocate
memory. The radix tree forced users to decide what the allocation
flags would be at creation time. It's much clearer to specify them
at allocation time. I know the MM people disapprove of the radix
tree using the top bits of the GFP flags for its own purpose, so
they'll like this aspect.
- Memory is not preloaded; we don't tie up dozens of pages on the
off chance that the slab allocator fails. Instead, we drop the lock,
allocate a new node and retry the operation.
- The xarray provides a conditional-replace operation. The radix tree
forces users to roll their own (and at least four have).
- Iterators now take a 'max' parameter. That simplifies many users and
will reduce the amount of iteration done.
- Iteration can proceed backwards. We only have one user for this, but
since it's called as part of the pagefault readahead algorithm, that
seemed worth mentioning.
- RCU-protected pointers are not exposed as part of the API. There are
some fun bugs where the page cache forgets to use rcu_dereference()
in the current codebase.
- Any function which wants it can now call the update_node() callback.
There were a few places missing that I noticed as part of this rewrite.
- Exceptional entries may now be BITS_PER_LONG-1 in size, rather than the
BITS_PER_LONG-2 that they had in the radix tree. That gives us the
extra bit we need to put huge page swap entries in the page cache.
The API comes in two parts, normal and advanced. The normal API takes
care of the locking and memory allocation for you. You can get the
value of a pointer by calling xa_load() and set the value of a pointer by
calling xa_store(). You can conditionally update the value of a pointer
by calling xa_cmpxchg(). Each pointer which isn't NULL can be tagged
with up to 3 bits of extra information, accessed through xa_get_tag(),
xa_set_tag() and xa_clear_tag(). You can copy batches of pointers out
of the array by calling xa_get_entries() or xa_get_tagged(). You can
iterate over pointers in the array by calling xa_find(), xa_find_after()
or xa_for_each().
The advanced API allows users to build their own operations. You have
to take care of your own locking and handle memory allocation failures.
Most of the advanced operations are based around the xa_state which
keeps state between sub-operations. Read the xarray.h header file for
more information on the advanced API, and see the implementation of the
normal API for examples of how to use the advanced API.
Those familiar with the radix tree may notice certain similarities between
the implementation of the xarray and the radix tree. That's entirely
intentional, but the implementation will certainly adapt in the future.
For example, one of the impediments I see to using xarrays instead of
kvmalloced arrays is memory consumption, so I have a couple of ideas to
reduce memory usage for smaller arrays.
I have reimplementated the IDR and the IDA based on the xarray. They are
roughly the same complexity as they were when implemented on top of the
radix tree (although much less intertwined).
When converting code from the radix tree to the xarray, the biggest thing
to bear in mind is that 'store' overwrites anything which happens to be
in the xarray. Just like the assignment operator. The equivalent to
the insert operation is to replace NULL with the new value.
A quick reference guide to help when converting radix tree code.
Functions which start 'xas' are XA_ADVANCED functions.
INIT_RADIX_TREE xa_init
radix_tree_empty xa_empty
__radix_tree_create xas_create
__radix_tree_insert xas_store
radix_tree_insert(x) xa_cmpxchg(NULL, x)
__radix_tree_lookup xas_load
radix_tree_lookup xa_load
radix_tree_lookup_slot xas_load
__radix_tree_replace xas_store
radix_tree_iter_replace xas_store
radix_tree_replace_slot xas_store
__radix_tree_delete_node xas_store
radix_tree_delete_item xa_cmpxhcg
radix_tree_delete xa_erase
radix_tree_clear_tags xas_init_tags
radix_tree_gang_lookup xa_get_entries
radix_tree_gang_lookup_slot xas_find (*1)
radix_tree_preload (*3)
radix_tree_maybe_preload (*3)
radix_tree_tag_set xa_set_tag
radix_tree_tag_clear xa_clear_tag
radix_tree_tag_get xa_get_tag
radix_tree_iter_tag_set xas_set_tag
radix_tree_gang_lookup_tag xa_get_tagged
radix_tree_gang_lookup_tag_slot xas_load (*2)
radix_tree_tagged xa_tagged
radix_tree_preload_end (*3)
radix_tree_split_preload (*3)
radix_tree_split xas_split (*4)
radix_tree_join xas_store
(*1) All three users of radix_tree_gang_lookup_slot() are using it to
ensure that there are no entries in a given range.
(*2) The one radix_tree_gang_lookup_tag_slot user should be using a
radix_tree_iter loop. It can use an xas_for_each() loop, or even an
xa_for_each() loop.
(*3) I don't think we're going to need a preallocation API. If we do
end up needing one, I have a plan that doesn't involve per-cpu
preallocation pools.
(*4) Not yet implemented