Re: [PATCH] iommu/iova: Update cached node pointer when current node fails to get any free IOVA

From: Ganapatrao Kulkarni
Date: Sun May 20 2018 - 20:34:29 EST


On Thu, Apr 26, 2018 at 3:15 PM, Ganapatrao Kulkarni <gklkml16@xxxxxxxxx> wrote:
> Hi Robin,
>
> On Mon, Apr 23, 2018 at 11:11 PM, Ganapatrao Kulkarni
> <gklkml16@xxxxxxxxx> wrote:
>> On Mon, Apr 23, 2018 at 10:07 PM, Robin Murphy <robin.murphy@xxxxxxx> wrote:
>>> On 19/04/18 18:12, Ganapatrao Kulkarni wrote:
>>>>
>>>> The performance drop is observed with long hours iperf testing using 40G
>>>> cards. This is mainly due to long iterations in finding the free iova
>>>> range in 32bit address space.
>>>>
>>>> In current implementation for 64bit PCI devices, there is always first
>>>> attempt to allocate iova from 32bit(SAC preferred over DAC) address
>>>> range. Once we run out 32bit range, there is allocation from higher range,
>>>> however due to cached32_node optimization it does not suppose to be
>>>> painful. cached32_node always points to recently allocated 32-bit node.
>>>> When address range is full, it will be pointing to last allocated node
>>>> (leaf node), so walking rbtree to find the available range is not
>>>> expensive affair. However this optimization does not behave well when
>>>> one of the middle node is freed. In that case cached32_node is updated
>>>> to point to next iova range. The next iova allocation will consume free
>>>> range and again update cached32_node to itself. From now on, walking
>>>> over 32-bit range is more expensive.
>>>>
>>>> This patch adds fix to update cached node to leaf node when there are no
>>>> iova free range left, which avoids unnecessary long iterations.
>>>
>>>
>>> The only trouble with this is that "allocation failed" doesn't uniquely mean
>>> "space full". Say that after some time the 32-bit space ends up empty except
>>> for one page at 0x1000 and one at 0x80000000, then somebody tries to
>>> allocate 2GB. If we move the cached node down to the leftmost entry when
>>> that fails, all subsequent allocation attempts are now going to fail despite
>>> the space being 99.9999% free!
>>>
>>> I can see a couple of ways to solve that general problem of free space above
>>> the cached node getting lost, but neither of them helps with the case where
>>> there is genuinely insufficient space (and if anything would make it even
>>> slower). In terms of the optimisation you want here, i.e. fail fast when an
>>> allocation cannot possibly succeed, the only reliable idea which comes to
>>> mind is free-PFN accounting. I might give that a go myself to see how ugly
>>> it looks.
>
> For this testing, dual port intel 40G card(XL710) used and both ports
> were connected in loop-back. Ran iperf server and clients on both
> ports(used NAT to route packets out on intended ports).There were 10
> iperf clients invoked every 60 seconds in loop for hours for each
> port. Initially the performance on both ports is seen close to line
> rate, however after test ran about 4 to 6 hours, the performance
> started dropping to very low (to few hundred Mbps) on both
> connections.
>
> IMO, this is common bug and should happen on any other platforms too
> and needs to be fixed at the earliest.
> Please let me know if you have better way to fix this, i am happy to
> test your patch!

any update on this issue?
>
>>
>> i see 2 problems in current implementation,
>> 1. We don't replenish the 32 bits range, until first attempt of second
>> allocation(64 bit) fails.
>> 2. Having per cpu cache might not yield good hit on platforms with
>> more number of CPUs.
>>
>> however irrespective of current issues, It makes sense to update
>> cached node as done in this patch , when there is failure to get iova
>> range using current cached pointer which is forcing for the
>> unnecessary time consuming do-while iterations until any replenish
>> happens!
>>
>> thanks
>> Ganapat
>>
>>>
>>> Robin.
>>>
>>>
>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@xxxxxxxxxx>
>>>> ---
>>>> drivers/iommu/iova.c | 6 ++++++
>>>> 1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>>>> index 83fe262..e6ee2ea 100644
>>>> --- a/drivers/iommu/iova.c
>>>> +++ b/drivers/iommu/iova.c
>>>> @@ -201,6 +201,12 @@ static int __alloc_and_insert_iova_range(struct
>>>> iova_domain *iovad,
>>>> } while (curr && new_pfn <= curr_iova->pfn_hi);
>>>> if (limit_pfn < size || new_pfn < iovad->start_pfn) {
>>>> + /* No more cached node points to free hole, update to leaf
>>>> node.
>>>> + */
>>>> + struct iova *prev_iova;
>>>> +
>>>> + prev_iova = rb_entry(prev, struct iova, node);
>>>> + __cached_rbnode_insert_update(iovad, prev_iova);
>>>> spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
>>>> return -ENOMEM;
>>>> }
>>>>
>>>
>
> thanks
> Ganapat