Re: [PATCH 3/4] iova: limit_pfn-aware augmentation for log(n) 32-bit alloc

From: Rik van Riel

Date: Fri May 29 2026 - 13:47:47 EST

On Thu, 2026-05-28 at 17:46 +0100, Robin Murphy wrote:
> Hi Rik,
>
> I'm certainly interested to see this series, as improving the
> allocator
> has been on my to-do list for a very long time now. It might take me
> a
> while to page all the details back in, but some higher-level things
> stand out already...
>
> On 2026-05-18 7:05 pm, Rik van Riel wrote:
> > From: Rik van Riel <riel@xxxxxxxx>
> >
> > The augmented-rbtree port made __iova_search_free_gap O(log n) when
> > limit_pfn doesn't bind any candidate gap, but degrades to O(n) when
> > limit_pfn is small relative to the gaps in the tree. The classic
> > case
> > is a 32-bit DMA allocation on a domain dominated by 64-bit
> > allocations:
> > the augmented invariant __subtree_max_gap is satisfied everywhere
> > (the
> > huge gap between the highest 64-bit allocation and the IOVA_ANCHOR
> > dominates), so pruning never fires, and every node above limit_pfn
> > must
> > be visited and rejected before falling through to the 32-bit
> > region.
>
> The main reason the anchor node exists is to ensure that cached_node
> can
> always be a valid place to start an allocation walk - if we're no
> longer
>
We kind of still need the anchor node, as the highest thing
in the rbtree, so we can use that as bookkeeping for the
largest gap.

Without it, we would not know there was an enormous 64 bit
gap if the first iova allocation is made with a 32 bit
mask.

We could add special casing, but the anchor node might be
simpler and smaller.

> > Add a second augmented field that bounds the search by 32-bit-
> > clamped
> > gap size:
>
> I would think that if we still need a special case for this then we
> don't really have the right allocation algorithm - I recall getting
> as
> far as concluding that it would be hard to do with the generic
> interval
> tree,

You are right, we do not need this. Walking the rbtree
from the root can bring us to the right address range
in O(log n) time, without any need for additional
augmented data.

> > For 64-bit allocations the behaviour is unchanged. For 32-bit
> > allocations on mixed-DMA-mask domains the search is now O(log n).
>
> And what about all the 33 to 56-bit DMA masks? They're not all that
> uncommon and deserve some love too. FWIW I'd imagine a general
> algorithm
> should be something like:
>
> - while pfn_lo > limit_pfn && left->max_gap >= size, go left
> - while pfn_hi + size <= limit_pfn && right->max_gap >= size, go
> right
> - if sufficient space to right, insert right and stop.
> - if left->max_gap > size, go left, else go up and left
> - repeat

I have this implemented here now.

I can send out v3 if you would like to see that version,
but you are right that the maple tree may be better!

>
> (If anything that then leans towards maintaining a dummy node down at
> start_pfn, such that we could avoid a special case for allocating off
> the leftmost end of the tree...)
>
> I guess another question to ask these days is perhaps whether rbtree
> is
> still even the best choice at all, or if something newer like maple
> tree
> might be able to be even better?
>

Doing this allows the struct iova to shrink from
40 bytes down to 16 bytes. However the iova slab
cache seems to specify SLAB_HWCACHE_ALIGN, resulting
in 64 bytes used per iova.

If we drop SLAB_HWCACHE_ALIGN and assume 70-90%
full maple tree nodes (I assume many iova requests
are of same/similar size - let me know if that's
wrong), we are looking at a reduction of memory
use per iova from 64 bytes allocated, to around
44-60 bytes per iova on average.

This is with allocation becoming O(log n).

Maple tree may be the way to go.

Let me go test this code, and I'd be happy
to post this as v3, instead of the latest
augmented rbtree code :)

--
All Rights Reversed.