[Userspace RCU] Introducing a new data structure: Fractal Trie

From: Mathieu Desnoyers

Date: Wed Apr 01 2026 - 13:50:06 EST

Hi!

I've implemented a new RCU Trie data structure, and would welcome
your feedback while it's in development stage.

It's currently sitting in a development branch of my private
userspace-rcu repository. It's currently targeting userspace
applications, but could be ported to the Linux kernel if there is a
use-case for it.

https://github.com/compudj/userspace-rcu-dev/tree/fractal-trie-dev

I've started working on this project around 2012, and I've recently
been able to find time to pursue this work. (Many thanks to ISC.org
for funding this!).

Here are a few major use-cases I see are a good fit:

- DNS resolver,
- IP routing tables.
- Real-time application data structure lookups with strict real-time
bound limits (wait-free, guaranteed bounded number of cache line
loads per key byte).

I'm pasting the API header top comment as summary of the feature set
here:

/*
* urcu/fractal-trie.h
*
* Userspace RCU library - Fractal Trie
*
* A concurrent, RCU-protected ordered trie mapping variable or
* fixed-length byte keys to user-defined nodes. Keys are opaque
* byte sequences with no reserved or sentinel values; unlike
* tries that rely on NUL or other terminal characters, any byte
* value may appear at any position in a key. Lookups and
* traversals are wait-free under the RCU read-side lock and can
* proceed concurrently with mutations. The internal structure is
* acyclic by construction: no sequence of concurrent updates can
* introduce a cycle, ensuring that all lookups and traversals
* complete in bounded time. Supports exact lookup, partial
* (prefix) match, ordered iteration, duplicate key chains, and
* range queries (<=, >=, <, >). A longest-match lookup reports
* the deepest trie position matching a prefix of the input key,
* even if that position has no external node.
*
* Performance characteristics:
*
* - Lookup: At most 2 cache-line accesses per key byte.
* - Ordered traversal: At most 3 cache-line accesses per node.
*
* Internal node configurations:
*
* Internal nodes self-adapt to the key population using several
* configurations (linear, 1D pool, 2D pool, pigeon), each with
* a different indexing strategy suited to its child density.
* The appropriate configuration is chosen automatically based
* on the number of children. Node sizes are powers of 2 between
* 16 bytes and 2048 bytes on 64-bit architectures (1024 bytes
* on 32-bit). Mutations use in-place updates when possible to
* minimize node recompaction, and hysteresis at size thresholds
* prevents repeated recompaction when the child count oscillates
* near a boundary.
*
* The 1D and 2D pool configurations partition the 8-bit key
* byte space using one or two bit positions respectively,
* distributing children across sub-nodes. This provides a
* range of intermediate node sizes between the compact linear
* configuration and the full 256-entry pigeon configuration,
* allowing memory-efficient representation of medium-density
* populations without requiring the full pigeon footprint.
* The bit positions are selected to minimize the maximum
* sub-node population, using a minimax criterion. The
* transition thresholds and worst-case sub-node sizes were
* empirically validated by brute-force enumeration of optimal
* bit selections across millions of random populations.
* Alternative approaches such as Judy use a population bitmap
* with a dense child array, which requires recompacting the
* array on every insertion or removal. This is incompatible
* with wait-free RCU lookups, since a reader could observe a
* partially recompacted array. The pool approach avoids this
* by using fixed index positions derived from key bits,
* allowing children to be added or removed with single-pointer
* updates visible atomically to concurrent readers.
*
* Node type and configuration are encoded in the low bits of
* child pointers (tagged pointers), so determining a node's
* layout during lookup requires no extra memory access.
*
* The 2D pool and pigeon configurations use a 32-byte bitmap
* to locate populated children via bit scanning, reducing the
* number of cache-line accesses needed for ordered traversal.
*
* Memory layout:
*
* Per-node metadata is stored in a separate page via a strided
* allocator and reached by pointer offset from the node address,
* keeping metadata out of the node's cache lines. This avoids
* padding overhead within nodes and ensures that lookups and
* traversals only touch the data they need. Internal nodes are
* reclaimed via RCU (call_rcu), so readers never access freed
* memory even for the trie's own internal structure.
*
* This provides memory efficiency comparable to adaptive radix
* tree schemes without requiring user tuning or configuration.
*
* Graft, graft-swap, and detach:
*
* The graft, graft-swap, and detach operations move entire
* sub-tries between trie instances within the same group. Each
* operation is a single pointer store visible atomically to
* concurrent RCU readers: a reader sees either the complete
* sub-trie or nothing, never a partial state.
*
* - Graft (cds_ft_graft) attaches the content of a source trie
* at a key position in a destination trie. The destination
* must have no content at or below the graft point. On
* success the source trie becomes empty and can be reused or
* destroyed.
*
* - Graft-swap (cds_ft_graft_swap) exchanges the content at a
* key position in a destination trie with the content of
* another trie. Unlike plain graft, the destination does not
* need to be empty at the graft point: any pre-existing
* content is moved into the swap trie. Concurrent readers
* see either the old content or the new content, never an
* empty intermediate state.
*
* - Detach (cds_ft_detach) removes the sub-structure rooted at
* a key position and returns it as a new, independent trie
* instance whose key lengths are relative to the detach
* point.
*
* Root-level operations (key_len 0) work with both fixed-length
* and variable-length key groups. Non-root operations (key_len
* > 0) require a variable-length key group
* (CDS_FT_LEN_VARIABLE): in a fixed-length group every key
* must equal the group's fixed length, so a shorter prefix
* needed to address an interior graft or detach point cannot be
* expressed, and the call returns
* CDS_FT_STATUS_INVALID_ARGUMENT_ERROR.
*
* Together, these operations serve as the transplant, exchange,
* and split primitives for bulk operations. A typical bulk-load
* pattern populates a staging trie offline — with no RCU or
* locking overhead — and then grafts it into the live trie in
* a single O(1) step, keeping the writer critical section
* minimal regardless of the number of nodes being moved. A
* complementary bulk-removal pattern detaches (or graft-swaps)
* a sub-trie in O(1), waits for a single grace period, and
* then drains the detached trie locally, reducing the cost
* from one grace period per node to one grace period total.
*
* Iterator Lifecycle and RCU Locking:
*
* The `cds_ft_iter` object can operate in two path modes, selected via
* cds_ft_iter_set_path_mode():
*
* CDS_FT_ITER_PATH_CACHED (default):
*
* The iterator caches the traversal path (RCU protected pointers) to
* accelerate subsequent sequential operations like cds_ft_next(),
* cds_ft_prev(), or cds_ft_remove().
*
* Because of this caching, the RCU read-side lock must be held
* CONTINUOUSLY between the operation that populates the iterator
* (e.g., cds_ft_lookup) and the operation that consumes it. If the
* RCU read-side lock is dropped, the cached path may point to freed
* memory.
*
* If you need to drop the RCU read-side lock between operations, you
* MUST invalidate the iterator's cached path before reusing it. This
* is done by calling:
* - cds_ft_iter_invalidate_path()
*
* Calling this function clears the internal path state while keeping
* your key intact. The next operation (e.g., cds_ft_remove) will
* automatically fall back to a safe, fresh top-down traversal.
*
* Example A (Continuous Lock - Fast):
* rcu_read_lock();
* cds_ft_lookup(ft, iter);
* cds_ft_remove(ft, iter, node); // Uses cached path O(1)
* rcu_read_unlock();
*
* Example B (Dropped Lock - CDS_FT_ITER_PATH_CACHED only):
* rcu_read_lock();
* cds_ft_lookup(ft, iter);
* rcu_read_unlock();
* // ... time passes, lock is dropped ...
*
* lock(&writer_mutex);
* cds_ft_iter_invalidate_path(iter); // Flush the stale RCU cache
* cds_ft_remove(ft, iter, node); // Safe: Fresh traversal protected by writer lock
* unlock(&writer_mutex);
*
* Note: CDS_FT_ITER_PATH_UNCACHED iterators do not require
* cds_ft_iter_invalidate_path() — the path is discarded
* automatically after each operation.
*
* Debug validation (URCU_FRACTAL_TRIE_DEBUG_PATH):
*
* Building with URCU_FRACTAL_TRIE_DEBUG_PATH defined enables run-time
* detection of stale cached paths. Each path population records an RCU
* grace-period snapshot (via the flavor's
* update_start_poll_synchronize_rcu); each path consumption polls it
* (via update_poll_state_synchronize_rcu). If a full grace period has
* elapsed since the path was populated, the cached pointers may
* reference freed memory — the program aborts with a diagnostic.
* Since struct cds_ft_iter is opaque, this option does not affect the
* application ABI — only the library needs to be rebuilt.
*
* CDS_FT_ITER_PATH_UNCACHED:
*
* The iterator automatically discards the traversal path after each
* operation returns. Every subsequent operation performs a fresh
* top-down traversal from the current key. The RCU read-side lock
* only needs to be held during each individual operation and while
* accessing the returned node — it may be dropped between operations.
*
* This mode is suited for iteration patterns where the caller needs
* to drop the RCU read-side lock between steps (e.g. to perform
* blocking work or acquire other locks).
*
* Example C (Uncached mode, per-step locking with refcount):
* cds_ft_iter_set_path_mode(iter, CDS_FT_ITER_PATH_UNCACHED);
* rcu_read_lock();
* cds_ft_for_each_rcu(ft, iter) {
* struct my_entry *e = cds_ft_entry(
* cds_ft_iter_node(iter),
* struct my_entry, ft_node);
* refcount_inc(&e->refcount);
* rcu_read_unlock();
*
* do_blocking_work(e);
* refcount_dec(&e->refcount);
*
* rcu_read_lock();
* }
* rcu_read_unlock();
*/

Your feedback is welcome!

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com