Re: [PATCH] iommu/iova: Bettering utilizing cpu_rcaches in no-strict mode

From: Robin Murphy
Date: Mon Jun 24 2024 - 09:36:09 EST


On 2024-06-24 2:13 pm, zhangzekun (A) wrote:

在 2024/6/24 18:56, Robin Murphy 写道:
On 2024-06-24 9:39 am, Zhang Zekun wrote:
Currently, when iommu working in no-strict mode, fq_flush_timeout()
will always try to free iovas on one cpu. Freeing the iovas from all
cpus on one cpu is not cache-friendly to iova_rcache, because it will
first filling up the cpu_rcache and then pushing iovas to the depot,
iovas in the depot will finally goto the underlying rbtree if the
depot_size is greater than num_online_cpus().

That is the design intent - if the excess magazines sit in the depot long enough to be reclaimed then other CPUs didn't want them either. We're trying to minimise the amount of unused cached IOVAs sitting around wasting memory, since IOVA memory consumption has proven to be quite significant on large systems.

Hi, Robin,

It does waste some memory after this change, but since we have been freeing iova on each cpu in strict-mode for years, I think this change seems reasonable to make the iova free logic identical to strict-mode.
This patch try to make the speed of allcating and freeing iova roughly same by better utilizing the iova_rcache, or we will more likely enter the slowpath. The save of memory consumption is actually at the cost of performance, I am not sure if we need such a optimization for no-strict mode which is mainly used for performance consideration.


As alluded to in the original cover letter, 100ms for IOVA_DEPOT_DELAY was just my arbitrary value of "long enough" to keep the initial implementation straightforward - I do expect that certain workloads might benefit from tuning it differently, but without proof of what they are and what they want, there's little justification for introducing extra complexity and potential user ABI yet.

Let fq_flush_timeout()
freeing iovas on cpus who call dma_unmap* APIs, can decrease the overall
time caused by fq_flush_timeout() by better utilizing the iova_rcache,
and minimizing the competition for the iova_rbtree_lock().

I would have assumed that a single CPU simply throwing magazines into the depot list from its own percpu cache is quicker, or at least no slower, than doing the same while causing additional contention/sharing by interfering with other percpu caches as well. And where does the rbtree come into that either way? If an observable performance issue actually exists here, I'd like a more detailed breakdown to understand it.

Thanks,
Robin.


This patch is firstly intent to minimize the chance of softlock issue in fq_flush_timeout(), which is already dicribed erarlier in [1], which has beed applied in a commercial kernel[2] for years.

However, the later tests show that this single patch is not enough to fix the softlockup issue, since the root cause of softlockup is the underlying iova_rbtree_lock. In our softlockup scenarios, the average
time cost to get this spinlock is about 6ms.

That should already be fixed, though. The only reason for fq_flush_timeout() to interact with the rbtree at all was due to the notion of a fixed-size depot which could become full. That no longer exists since 911aa1245da8 ("iommu/iova: Make the rcache depot scale better").

Thanks,
Robin.