Re: [PATCH v5 3/6] iommu/iova: Extend rbtree node caching

From: Robin Murphy
Date: Fri Sep 22 2017 - 13:25:12 EST


On 22/09/17 17:21, Tomasz Nowicki wrote:
> Hi Robin,
>
> On 21.09.2017 17:52, Robin Murphy wrote:
>> The cached node mechanism provides a significant performance benefit for
>> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
>> or where the 32-bit space is full, the loss of this benefit can be
>> significant - on large systems there can be many thousands of entries in
>> the tree, such that walking all the way down to find free space every
>> time becomes increasingly awful.
>>
>> Maintain a similar cached node for the whole IOVA space as a superset of
>> the 32-bit space so that performance can remain much more consistent.
>>
>> Inspired by work by Zhen Lei <thunder.leizhen@xxxxxxxxxx>.
>>
>> Tested-by: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>
>> Tested-by: Zhen Lei <thunder.leizhen@xxxxxxxxxx>
>> Tested-by: Nate Watterson <nwatters@xxxxxxxxxxxxxx>
>> Signed-off-by: Robin Murphy <robin.murphy@xxxxxxx>
>> ---
>>
>> v5: Fixed __cached_rbnode_delete_update() logic to update both nodes
>> ÂÂÂÂ when necessary
>>
>> Â drivers/iommu/iova.c | 60
>> ++++++++++++++++++++++++----------------------------
>> Â include/linux/iova.h |Â 3 ++-
>> Â 2 files changed, 30 insertions(+), 33 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 20be9a8b3188..c6f5a22f8d20 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -48,6 +48,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
>> long granule,
>> Â ÂÂÂÂÂ spin_lock_init(&iovad->iova_rbtree_lock);
>> ÂÂÂÂÂ iovad->rbroot = RB_ROOT;
>> +ÂÂÂ iovad->cached_node = NULL;
>> ÂÂÂÂÂ iovad->cached32_node = NULL;
>> ÂÂÂÂÂ iovad->granule = granule;
>> ÂÂÂÂÂ iovad->start_pfn = start_pfn;
>> @@ -110,48 +111,44 @@ EXPORT_SYMBOL_GPL(init_iova_flush_queue);
>> Â static struct rb_node *
>> Â __get_cached_rbnode(struct iova_domain *iovad, unsigned long
>> *limit_pfn)
>> Â {
>> -ÂÂÂ if ((*limit_pfn > iovad->dma_32bit_pfn) ||
>> -ÂÂÂÂÂÂÂ (iovad->cached32_node == NULL))
>> +ÂÂÂ struct rb_node *cached_node = NULL;
>> +ÂÂÂ struct iova *curr_iova;
>> +
>> +ÂÂÂ if (*limit_pfn <= iovad->dma_32bit_pfn)
>> +ÂÂÂÂÂÂÂ cached_node = iovad->cached32_node;
>> +ÂÂÂ if (!cached_node)
>> +ÂÂÂÂÂÂÂ cached_node = iovad->cached_node;
>> +ÂÂÂ if (!cached_node)
>> ÂÂÂÂÂÂÂÂÂ return rb_last(&iovad->rbroot);
>> -ÂÂÂ else {
>> -ÂÂÂÂÂÂÂ struct rb_node *prev_node = rb_prev(iovad->cached32_node);
>> -ÂÂÂÂÂÂÂ struct iova *curr_iova =
>> -ÂÂÂÂÂÂÂÂÂÂÂ rb_entry(iovad->cached32_node, struct iova, node);
>> -ÂÂÂÂÂÂÂ *limit_pfn = curr_iova->pfn_lo;
>> -ÂÂÂÂÂÂÂ return prev_node;
>> -ÂÂÂ }
>> +
>> +ÂÂÂ curr_iova = rb_entry(cached_node, struct iova, node);
>> +ÂÂÂ *limit_pfn = min(*limit_pfn, curr_iova->pfn_lo);
>
> I guess this it the fix for stale pointer in iovad->cached32_node from
> previous series but I think this is something more.
>
> With this min() here we have real control over the limit_pfn we would
> like to allocate at most. In other works, without your series two
> subsequent calls can give us:
> iova (ffff) = alloc_iova_fast(iovad, 1, DMA_BIT_MASK(32) >> shift);
>
> iova (fffe)= alloc_iova_fast(iovad, 1, DMA_BIT_MASK(16) >> shift);
>
> We do not see this since nobody uses limit_pfn below DMA_BIT_MASK(32)
> now. That might be done intentionally so I ask for your opinion.

I realise it's not called out in the commit message, but this patch does
make one small change to how the existing 32-bit caching behaves when
freeing the topmost 32-bit IOVA. With the previous behaviour,
__cached_rbnode_delete_update() set cached32_node to NULL when the
rb_next(free) node lies above dma_32bit_pfn, meaning the next 32-bit
allocation returns from here early with rb_last(). Therefore limit_pfn
is updated unconditionally when a cached node exists on the expectation
that it will only ever move downwards.

...and having worked through that, I now realise I failed to take it
into account in 62280cf2e8bb, so yes, the theoretical case of a 32-bit
allocation followed by a <32-bit allocation has in fact been broken (in
that it could adjust limit_pfn upwards and return an too-high IOVA for
the second call) for a while. Grrr...

With the new behaviour from this patch, freeing the topmost 32-bit IOVA
can let cached32_node point at the lowest >32-bit IOVA to avoid the
rb_last() overhead on the next allocation - that's how the stale pointer
bug could now happen (which is fixed by always checking both nodes in
__cached_rbnode_delete_update() below). Because cached32_node may now
occasionally point at something for which pfn_lo > dma_32bit_pfn, we add
the min() here to make sure we never move limit_pfn upwards (and thus
inadvertently fix the <32-bit case as well).

I admit that one of my motivations for rewriting so much here is just
because the existing code is so horrendously subtle and tricky. I'd like
to think that by patch #6 it's actually understandable without spending
several days picking through it...

Robin.

> Also, with your patch and two the same alloc_iova_fast() calls in
> iteration may get 32-bit space full much faster. Please correct me if I
> messed things up.
>
> Thanks,
> Tomasz
>
>
>> +
>> +ÂÂÂ return rb_prev(cached_node);
>> Â }
>> Â Â static void
>> -__cached_rbnode_insert_update(struct iova_domain *iovad,
>> -ÂÂÂ unsigned long limit_pfn, struct iova *new)
>> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova
>> *new)
>> Â {
>> -ÂÂÂ if (limit_pfn != iovad->dma_32bit_pfn)
>> -ÂÂÂÂÂÂÂ return;
>> -ÂÂÂ iovad->cached32_node = &new->node;
>> +ÂÂÂ if (new->pfn_hi < iovad->dma_32bit_pfn)
>> +ÂÂÂÂÂÂÂ iovad->cached32_node = &new->node;
>> +ÂÂÂ else
>> +ÂÂÂÂÂÂÂ iovad->cached_node = &new->node;
>> Â }
>> Â Â static void
>> Â __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova
>> *free)
>> Â {
>> ÂÂÂÂÂ struct iova *cached_iova;
>> -ÂÂÂ struct rb_node *curr;
>> Â -ÂÂÂ if (!iovad->cached32_node)
>> -ÂÂÂÂÂÂÂ return;
>> -ÂÂÂ curr = iovad->cached32_node;
>> -ÂÂÂ cached_iova = rb_entry(curr, struct iova, node);
>> +ÂÂÂ cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
>> +ÂÂÂ if (free->pfn_hi < iovad->dma_32bit_pfn &&
>> +ÂÂÂÂÂÂÂ iovad->cached32_node && free->pfn_lo >= cached_iova->pfn_lo)
>> +ÂÂÂÂÂÂÂ iovad->cached32_node = rb_next(&free->node);
>> Â -ÂÂÂ if (free->pfn_lo >= cached_iova->pfn_lo) {
>> -ÂÂÂÂÂÂÂ struct rb_node *node = rb_next(&free->node);
>> -ÂÂÂÂÂÂÂ struct iova *iova = rb_entry(node, struct iova, node);
>> -
>> -ÂÂÂÂÂÂÂ /* only cache if it's below 32bit pfn */
>> -ÂÂÂÂÂÂÂ if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
>> -ÂÂÂÂÂÂÂÂÂÂÂ iovad->cached32_node = node;
>> -ÂÂÂÂÂÂÂ else
>> -ÂÂÂÂÂÂÂÂÂÂÂ iovad->cached32_node = NULL;
>> -ÂÂÂ }
>> +ÂÂÂ cached_iova = rb_entry(iovad->cached_node, struct iova, node);
>> +ÂÂÂ if (iovad->cached_node && free->pfn_lo >= cached_iova->pfn_lo)
>> +ÂÂÂÂÂÂÂ iovad->cached_node = rb_next(&free->node);
>> Â }
>> Â Â /* Insert the iova into domain rbtree by holding writer lock */
>> @@ -188,7 +185,7 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>> Â {
>> ÂÂÂÂÂ struct rb_node *prev, *curr = NULL;
>> ÂÂÂÂÂ unsigned long flags;
>> -ÂÂÂ unsigned long saved_pfn, new_pfn;
>> +ÂÂÂ unsigned long new_pfn;
>> ÂÂÂÂÂ unsigned long align_mask = ~0UL;
>> Â ÂÂÂÂÂ if (size_aligned)
>> @@ -196,7 +193,6 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>> Â ÂÂÂÂÂ /* Walk the tree backwards */
>> ÂÂÂÂÂ spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
>> -ÂÂÂ saved_pfn = limit_pfn;
>> ÂÂÂÂÂ curr = __get_cached_rbnode(iovad, &limit_pfn);
>> ÂÂÂÂÂ prev = curr;
>> ÂÂÂÂÂ while (curr) {
>> @@ -226,7 +222,7 @@ static int __alloc_and_insert_iova_range(struct
>> iova_domain *iovad,
>> Â ÂÂÂÂÂ /* If we have 'prev', it's a valid place to start the
>> insertion. */
>> ÂÂÂÂÂ iova_insert_rbtree(&iovad->rbroot, new, prev);
>> -ÂÂÂ __cached_rbnode_insert_update(iovad, saved_pfn, new);
>> +ÂÂÂ __cached_rbnode_insert_update(iovad, new);
>> Â ÂÂÂÂÂ spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>> Â diff --git a/include/linux/iova.h b/include/linux/iova.h
>> index d179b9bf7814..69ea3e258ff2 100644
>> --- a/include/linux/iova.h
>> +++ b/include/linux/iova.h
>> @@ -70,7 +70,8 @@ struct iova_fq {
>> Â struct iova_domain {
>> ÂÂÂÂÂ spinlock_tÂÂÂ iova_rbtree_lock; /* Lock to protect update of
>> rbtree */
>> ÂÂÂÂÂ struct rb_rootÂÂÂ rbroot;ÂÂÂÂÂÂÂ /* iova domain rbtree root */
>> -ÂÂÂ struct rb_nodeÂÂÂ *cached32_node; /* Save last alloced node */
>> +ÂÂÂ struct rb_nodeÂÂÂ *cached_node;ÂÂÂ /* Save last alloced node */
>> +ÂÂÂ struct rb_nodeÂÂÂ *cached32_node; /* Save last 32-bit alloced
>> node */
>> ÂÂÂÂÂ unsigned longÂÂÂ granule;ÂÂÂ /* pfn granularity for this domain */
>> ÂÂÂÂÂ unsigned longÂÂÂ start_pfn;ÂÂÂ /* Lower limit for this domain */
>> ÂÂÂÂÂ unsigned longÂÂÂ dma_32bit_pfn;
>>
>