Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

From: Leizhen (ThunderTown)
Date: Tue Sep 19 2017 - 05:47:33 EST




On 2017/7/31 19:42, Robin Murphy wrote:
> Hi Nate,
>
> On 29/07/17 04:57, Nate Watterson wrote:
>> Hi Robin,
>> I am seeing a crash when performing very basic testing on this series
>> with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
>> patch is the culprit, but this rcache business is still mostly
>> witchcraft to me.
>>
>> # ifconfig eth5 up
>> # ifconfig eth5 down
>> Unable to handle kernel NULL pointer dereference at virtual address
>> 00000020
>> user pgtable: 64k pages, 48-bit VAs, pgd = ffff8007dbf47c00
>> [0000000000000020] *pgd=00000006efab0003, *pud=00000006efab0003,
>> *pmd=00000007d8720003, *pte=0000000000000000
>> Internal error: Oops: 96000007 [#1] SMP
>> Modules linked in:
>> CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
>> task: ffff8007da1e5780 task.stack: ffff8007ddcb8000
>> PC is at __cached_rbnode_delete_update+0x2c/0x58
>> LR is at private_free_iova+0x2c/0x60
>> pc : [<ffff00000852c6e4>] lr : [<ffff00000852cdac>] pstate: 204001c5
>> sp : ffff8007ddcbba00
>> x29: ffff8007ddcbba00 x28: ffff8007c8350210
>> x27: ffff8007d1a80000 x26: ffff8007dcc20800
>> x25: 0000000000000140 x24: ffff8007c98f0008
>> x23: 00000000fffffe4e x22: 0000000000000140
>> x21: ffff8007c98f0008 x20: ffff8007c9adb240
>> x19: ffff8007c98f0018 x18: 0000000000000010
>> x17: 0000000000000000 x16: 0000000000000000
>> x15: 4000000000000000 x14: 0000000000000000
>> x13: 00000000ffffffff x12: 0000000000000001
>> x11: dead000000000200 x10: 00000000ffffffff
>> x9 : 0000000000000000 x8 : ffff8007c9adb1c0
>> x7 : 0000000040002000 x6 : 0000000000210d00
>> x5 : 0000000000000000 x4 : 000000000000c57e
>> x3 : 00000000ffffffcf x2 : 00000000ffffffcf
>> x1 : ffff8007c9adb240 x0 : 0000000000000000
>> [...]
>> [<ffff00000852c6e4>] __cached_rbnode_delete_update+0x2c/0x58
>> [<ffff00000852cdac>] private_free_iova+0x2c/0x60
>> [<ffff00000852cea4>] iova_magazine_free_pfns+0x4c/0xa0
>> [<ffff00000852d0a8>] free_iova_fast+0x1b0/0x230
>> [<ffff000008529d34>] iommu_dma_free_iova+0x5c/0x80
>> [<ffff000008529db4>] __iommu_dma_unmap+0x5c/0x98
>> [<ffff00000852aebc>] iommu_dma_unmap_resource+0x24/0x30
>> [<ffff00000852aed4>] iommu_dma_unmap_page+0xc/0x18
>> [<ffff000008093030>] __iommu_unmap_page+0x40/0x60
>> [<ffff0000087401bc>] mlx5e_page_release+0xbc/0x128
>> [<ffff000008740498>] mlx5e_dealloc_rx_wqe+0x30/0x40
>> [<ffff000008735ef0>] mlx5e_close_channel+0x70/0x1f8
>> [<ffff0000087370f4>] mlx5e_close_channels+0x2c/0x50
>> [<ffff0000087385dc>] mlx5e_close_locked+0x54/0x68
>> [<ffff000008738620>] mlx5e_close+0x30/0x58
>> [...]
>>
>> ** Disassembly for __cached_rbnode_delete_update() near the fault **
>> 92|if (free->pfn_hi < iovad->dma_32bit_pfn)
>> FFFF00000852C6C4| ldr x3,[x1,#0x18] ; x3,[free,#24]
>> FFFF00000852C6C8| ldr x2,[x0,#0x30] ; x2,[iovad,#48]
>> FFFF00000852C6CC| cmp x3,x2
>> FFFF00000852C6D0| b.cs 0xFFFF00000852C708
>> | curr = &iovad->cached32_node;
>> 94|if (!curr)
>> FFFF00000852C6D4| adds x19,x0,#0x18 ; x19,iovad,#24
>> FFFF00000852C6D8| b.eq 0xFFFF00000852C708
>> |
>> |cached_iova = rb_entry(*curr, struct iova, node);
>> |
>> 99|if (free->pfn_lo >= cached_iova->pfn_lo)
>> FFFF00000852C6DC| ldr x0,[x19] ; xiovad,[curr]
>> FFFF00000852C6E0| ldr x2,[x1,#0x20] ; x2,[free,#32]
>> FFFF00000852C6E4| ldr x0,[x0,#0x20] ; x0,[x0,#32]
>> Apparently cached_iova was NULL so the pfn_lo access faulted.
>>
>> FFFF00000852C6E8| cmp x2,x0
>> FFFF00000852C6EC| b.cc 0xFFFF00000852C6FC
>> FFFF00000852C6F0| mov x0,x1 ; x0,free
>> 100| *curr = rb_next(&free->node);
>> After instrumenting the code a bit, this seems to be the culprit. In the
>> previous call, free->pfn_lo was 0xFFFF_FFFF which is actually the
>> dma_limit for the domain so rb_next() returns NULL.
>>
>> Let me know if you have any questions or would like additional tests
>> run. I also applied your "DMA domain debug info" patches and dumped the
>> contents of the domain at each of the steps above in case that would be
>> useful. If nothing else, they reinforce how thirsty the CX4 NIC is
>> especially when using 64k pages and many CPUs.
>
> Thanks for the report - I somehow managed to reason myself out of
> keeping the "no cached node" check in __cached_rbnode_delete_update() on
> the assumption that it must always be set by a previous allocation.
> However, there is indeed just one case case for which that fails: when
> you free any IOVA immediately after freeing the very topmost one. Which
> is something that freeing an entire magazine's worth of IOVAs back to
> the tree all at once has a very real chance of doing...
>
> The obvious straightforward fix is inline below, but I'm now starting to
> understand the appeal of reserving a sentinel node to ensure the tree
> can never be empty, so I might have a quick go at that to see if it
> results in nicer code overall.
>
> Robin.
>
>>
>> -Nate
>>
>> On 7/21/2017 7:42 AM, Robin Murphy wrote:
>>> The cached node mechanism provides a significant performance benefit for
>>> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
>>> or where the 32-bit space is full, the loss of this benefit can be
>>> significant - on large systems there can be many thousands of entries in
>>> the tree, such that traversing to the end then walking all the way down
>>> to find free space every time becomes increasingly awful.
>>>
>>> Maintain a similar cached node for the whole IOVA space as a superset of
>>> the 32-bit space so that performance can remain much more consistent.
>>>
>>> Inspired by work by Zhen Lei <thunder.leizhen@xxxxxxxxxx>.
>>>
>>> Tested-by: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>
>>> Tested-by: Zhen Lei <thunder.leizhen@xxxxxxxxxx>
>>> Signed-off-by: Robin Murphy <robin.murphy@xxxxxxx>
>>> ---
>>>
>>> v2: No change
>>>
>>> drivers/iommu/iova.c | 59
>>> +++++++++++++++++++++++++---------------------------
>>> include/linux/iova.h | 3 ++-
>>> 2 files changed, 30 insertions(+), 32 deletions(-)
>>>
>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>> index d094d1ca8f23..f5809a2ee6c2 100644
>>> --- a/drivers/iommu/iova.c
>>> +++ b/drivers/iommu/iova.c
>>> @@ -46,6 +46,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
>>> long granule,
>>> spin_lock_init(&iovad->iova_rbtree_lock);
>>> iovad->rbroot = RB_ROOT;
>>> + iovad->cached_node = NULL;
>>> iovad->cached32_node = NULL;
>>> iovad->granule = granule;
>>> iovad->start_pfn = start_pfn;
>>> @@ -57,48 +58,46 @@ EXPORT_SYMBOL_GPL(init_iova_domain);
>>> static struct rb_node *
>>> __get_cached_rbnode(struct iova_domain *iovad, unsigned long
>>> *limit_pfn)
>>> {
>>> - if ((*limit_pfn > iovad->dma_32bit_pfn) ||
>>> - (iovad->cached32_node == NULL))
>>> + struct rb_node *cached_node = NULL;
>>> + struct iova *curr_iova;
>>> +
>>> + if (*limit_pfn <= iovad->dma_32bit_pfn)
>>> + cached_node = iovad->cached32_node;
>>> + if (!cached_node)
>>> + cached_node = iovad->cached_node;
>>> + if (!cached_node)
>>> return rb_last(&iovad->rbroot);
>>> - else {
>>> - struct rb_node *prev_node = rb_prev(iovad->cached32_node);
>>> - struct iova *curr_iova =
>>> - rb_entry(iovad->cached32_node, struct iova, node);
>>> - *limit_pfn = curr_iova->pfn_lo;
>>> - return prev_node;
>>> - }
>>> +
>>> + curr_iova = rb_entry(cached_node, struct iova, node);
>>> + *limit_pfn = curr_iova->pfn_lo;
>>> +
>>> + return rb_prev(cached_node);
>>> }
>>> static void
>>> -__cached_rbnode_insert_update(struct iova_domain *iovad,
>>> - unsigned long limit_pfn, struct iova *new)
>>> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova
>>> *new)
>>> {
>>> - if (limit_pfn != iovad->dma_32bit_pfn)
>>> - return;
>>> - iovad->cached32_node = &new->node;
>>> + if (new->pfn_lo > iovad->dma_32bit_pfn)
>>> + iovad->cached_node = &new->node;
>>> + else
>>> + iovad->cached32_node = &new->node;
>>> }
>>> static void
>>> __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova
>>> *free)
>>> {
>>> struct iova *cached_iova;
>>> - struct rb_node *curr;
>>> + struct rb_node **curr = NULL;
>>> - if (!iovad->cached32_node)
>>> - return;
>>> - curr = iovad->cached32_node;
>>> - cached_iova = rb_entry(curr, struct iova, node);

-----------------------------
>>> + if (free->pfn_hi < iovad->dma_32bit_pfn)
>>> + curr = &iovad->cached32_node;
>>> + if (!curr)
>>> + curr = &iovad->cached_node;
>
> + if (!*curr)
> + return;
Is it necessary for us to try the following adjustment?
+ if (free->pfn_hi < iovad->dma_32bit_pfn)
+ curr = &iovad->cached32_node;
+ else
+ curr = &iovad->cached_node;
+
+ if (!*curr) {
+ *curr = rb_next(&free->node);
+ return;
+ }


>
>>> - if (free->pfn_lo >= cached_iova->pfn_lo) {
>>> - struct rb_node *node = rb_next(&free->node);
>>> - struct iova *iova = rb_entry(node, struct iova, node);
>>> + cached_iova = rb_entry(*curr, struct iova, node);
>>> - /* only cache if it's below 32bit pfn */
>>> - if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
>>> - iovad->cached32_node = node;
>>> - else
>>> - iovad->cached32_node = NULL;
>>> - }
>>> + if (free->pfn_lo >= cached_iova->pfn_lo)
>>> + *curr = rb_next(&free->node);
>>> }
>>> /* Insert the iova into domain rbtree by holding writer lock */
>>> @@ -135,7 +134,6 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>> {
>>> struct rb_node *prev, *curr = NULL;
>>> unsigned long flags;
>>> - unsigned long saved_pfn;
>>> unsigned long align_mask = ~0UL;
>>> if (size_aligned)
>>> @@ -143,7 +141,6 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>> /* Walk the tree backwards */
>>> spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
>>> - saved_pfn = limit_pfn;
>>> curr = __get_cached_rbnode(iovad, &limit_pfn);
>>> prev = curr;
>>> while (curr) {
>>> @@ -173,7 +170,7 @@ static int __alloc_and_insert_iova_range(struct
>>> iova_domain *iovad,
>>> /* If we have 'prev', it's a valid place to start the
>>> insertion. */
>>> iova_insert_rbtree(&iovad->rbroot, new, prev);
>>> - __cached_rbnode_insert_update(iovad, saved_pfn, new);
>>> + __cached_rbnode_insert_update(iovad, new);
>>> spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>> diff --git a/include/linux/iova.h b/include/linux/iova.h
>>> index e0a892ae45c0..0bb8df43b393 100644
>>> --- a/include/linux/iova.h
>>> +++ b/include/linux/iova.h
>>> @@ -40,7 +40,8 @@ struct iova_rcache {
>>> struct iova_domain {
>>> spinlock_t iova_rbtree_lock; /* Lock to protect update of
>>> rbtree */
>>> struct rb_root rbroot; /* iova domain rbtree root */
>>> - struct rb_node *cached32_node; /* Save last alloced node */
>>> + struct rb_node *cached_node; /* Save last alloced node */
>>> + struct rb_node *cached32_node; /* Save last 32-bit alloced
>>> node */
>>> unsigned long granule; /* pfn granularity for this domain */
>>> unsigned long start_pfn; /* Lower limit for this domain */
>>> unsigned long dma_32bit_pfn;
>>>
>>
>
>
> .
>

--
Thanks!
BestRegards