Re: [PATCH v5 3/6] iommu/iova: Extend rbtree node caching

From: tn
Date: Fri Sep 22 2017 - 12:51:20 EST

On 22.09.2017 18:21, Tomasz Nowicki wrote:
Hi Robin,

On 21.09.2017 17:52, Robin Murphy wrote:
The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that walking all the way down to find free space every
time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei <thunder.leizhen@xxxxxxxxxx>.

Tested-by: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>
Tested-by: Zhen Lei <thunder.leizhen@xxxxxxxxxx>
Tested-by: Nate Watterson <nwatters@xxxxxxxxxxxxxx>
Signed-off-by: Robin Murphy <robin.murphy@xxxxxxx>

v5: Fixed __cached_rbnode_delete_update() logic to update both nodes
ÂÂÂÂ when necessary

 drivers/iommu/iova.c | 60 ++++++++++++++++++++++++----------------------------
 include/linux/iova.h | 3 ++-
 2 files changed, 30 insertions(+), 33 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 20be9a8b3188..c6f5a22f8d20 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -48,6 +48,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
ÂÂÂÂÂ spin_lock_init(&iovad->iova_rbtree_lock);
ÂÂÂÂÂ iovad->rbroot = RB_ROOT;
+ÂÂÂ iovad->cached_node = NULL;
ÂÂÂÂÂ iovad->cached32_node = NULL;
ÂÂÂÂÂ iovad->granule = granule;
ÂÂÂÂÂ iovad->start_pfn = start_pfn;
@@ -110,48 +111,44 @@ EXPORT_SYMBOL_GPL(init_iova_flush_queue);
 static struct rb_node *
 __get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
-ÂÂÂ if ((*limit_pfn > iovad->dma_32bit_pfn) ||
-ÂÂÂÂÂÂÂ (iovad->cached32_node == NULL))
+ÂÂÂ struct rb_node *cached_node = NULL;
+ÂÂÂ struct iova *curr_iova;
+ÂÂÂ if (*limit_pfn <= iovad->dma_32bit_pfn)
+ÂÂÂÂÂÂÂ cached_node = iovad->cached32_node;
+ÂÂÂ if (!cached_node)
+ÂÂÂÂÂÂÂ cached_node = iovad->cached_node;
+ÂÂÂ if (!cached_node)
ÂÂÂÂÂÂÂÂÂ return rb_last(&iovad->rbroot);
-ÂÂÂ else {
-ÂÂÂÂÂÂÂ struct rb_node *prev_node = rb_prev(iovad->cached32_node);
-ÂÂÂÂÂÂÂ struct iova *curr_iova =
-ÂÂÂÂÂÂÂÂÂÂÂ rb_entry(iovad->cached32_node, struct iova, node);
-ÂÂÂÂÂÂÂ *limit_pfn = curr_iova->pfn_lo;
-ÂÂÂÂÂÂÂ return prev_node;
-ÂÂÂ }
+ÂÂÂ curr_iova = rb_entry(cached_node, struct iova, node);
+ÂÂÂ *limit_pfn = min(*limit_pfn, curr_iova->pfn_lo);

I guess this it the fix for stale pointer in iovad->cached32_node from previous series but I think this is something more.

With this min() here we have real control over the limit_pfn we would like to allocate at most. In other works, without your series two subsequent calls can give us:
iova (ffff) = alloc_iova_fast(iovad, 1, DMA_BIT_MASK(32) >> shift);

iova (fffe)= alloc_iova_fast(iovad, 1, DMA_BIT_MASK(16) >> shift);

We do not see this since nobody uses limit_pfn below DMA_BIT_MASK(32) now. That might be done intentionally so I ask for your opinion.

Also, with your patch and two the same alloc_iova_fast() calls in iteration may get 32-bit space full much faster. Please correct me if I messed things up.

After few more thoughts, I think this is unrealistic case.

In any case, please consider below fix:
static void
-__cached_rbnode_insert_update(struct iova_domain *iovad,
- unsigned long limit_pfn, struct iova *new)
+__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
- if (limit_pfn != iovad->dma_32bit_pfn)
- return;
- iovad->cached32_node = &new->node;
+ if (new->pfn_hi == iovad->dma_32bit_pfn)
+ iovad->cached32_node = &new->node;
+ else
+ iovad->cached_node = &new->node;