RE: [PATCH] radix-tree: get_slot_offset() returns invalid offset when parent is NULL

From: Matthew Wilcox
Date: Fri Oct 13 2017 - 11:40:55 EST


From: Wei Yang [mailto:richard.weiyang@xxxxxxxxx]
> On Wed, Oct 11, 2017 at 04:39:40PM -0700, Andrew Morton wrote:
> >On Wed, 11 Oct 2017 10:33:39 +0800 Wei Yang
> >> BTW, I have another question on the implementation of radix tree.
> >>
> >> #Question: why 6 is one of the choice of RADIX_TREE_MAP_SHIFT
> >>
> >> Currently, the shift of the tree is defined as below.
> >>
> >> #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> >>
> >> The index in the tree is with type unsigned long, which means the size is
> >> 32bits or 64bits.
> >>
> >> 32 % 4 = 0, 64 % 4 = 0
> >>
> >> 32 % 6 = 2, 64 % 6 = 4
> >>
> >> This means when the tree shift is 6 and represents the max index, there
> would
> >> be some unused slots at the top level.
> >>
> >> So why we chose 6 instead of a 2^N?
> >>
> >> The original commit is 139e561660fe11e0fc35e142a800df3dd7d03e9d,
> which I don't
> >> see some reason for this choice.
> >>
> >> If possible, I suggest to change this value to a power of 2. Or maybe I missed
> >> some critical reason behind this.
> >>
> >
> >I'm not sure I really see the problem. 1<<6 is still a power of 2.
> >radix_tree_split_preload() is doing mod and div on shift distances
> >which hurts my brain a bit. But CONFIG_BASE_SMALL=0 is what most
> >people will be using so presumably the current code tests out OK.
> >Although simple memory wastage would be hard to detect. Matthew has
> >been here recently and perhaps can consider?
>
> Hmm, I may not state it clearly.
>
> What I mean is to set RADIX_TREE_MAP_SHIFT a power of 2, instead of the
> result
> of (1 << RADIX_TREE_MAP_SHIFT).
>
> Let me explain a simplified example. Assume the max index is 16 bits.
>
> Case 1, shift is 4
>
> Each 4 bit forms a group and exactly 4 groups.
>
> |0123|4567|0123|4567|
>
> Case 2, shift is 6
>
> Each 6 bit forms a group and left 3 groups with 2 empty slots at the
> beginning.
>
> | 0123|456701|234567|
>
> Radix tree is a little bit like the page table, which is divided into several
> levels and each level is referenced by a "slot" of bits.
>
> When the shift is a power of 2, the digit fits in exact "slots". This means
> there is no waste.
>
> When the shift is not, we will have several unused digits at the most
> significant part. This means several slot will never be used.
>
> This means on theory, in a tree with shift 6
>
> | 0123|456701|234567|
>
> We can store an index with value
>
> ffffff ffffff ffffff
>
> While actually the largest value a 16 bits index could represents
>
> __ffff ffffff ffffff
>
> Since radix tree will not allocate memory when there is no corresponding
> index, this will just leave several unused slot at the top level.

Yes, if we have a tree with 2^64 pointers in it, we'll have 48 unused pointers in the top level leaf node. That's not a huge percentage of waste.

> While I still prefer to fully utilize the whole tree instead of leave some
> never touch slot. And this can be achieved by set RADIX_TREE_MAP_SHIFT to a
> power of 2.
>
> So if there is no specific reason to use 6, I would like to form a patch to
> change it to 8.

The cost of that is in the slab allocator. With 64bit pointers and a 64-entry node, we use 576 bytes per node, fitting 7 nodes per page. With 256 entries per node, the node balloons to just over 2kB, and we get 3 nodes per pair of pages. This is massively less efficient than the inefficiency youâre complaining about.