Re: [PATCH] iommu/iova: Bettering utilizing cpu_rcaches in no-strict mode
From: zhangzekun (A)
Date: Mon Jun 24 2024 - 09:13:23 EST
在 2024/6/24 18:56, Robin Murphy 写道:
On 2024-06-24 9:39 am, Zhang Zekun wrote:
Currently, when iommu working in no-strict mode, fq_flush_timeout()
will always try to free iovas on one cpu. Freeing the iovas from all
cpus on one cpu is not cache-friendly to iova_rcache, because it will
first filling up the cpu_rcache and then pushing iovas to the depot,
iovas in the depot will finally goto the underlying rbtree if the
depot_size is greater than num_online_cpus().
That is the design intent - if the excess magazines sit in the depot
long enough to be reclaimed then other CPUs didn't want them either.
We're trying to minimise the amount of unused cached IOVAs sitting
around wasting memory, since IOVA memory consumption has proven to be
quite significant on large systems.
Hi, Robin,
It does waste some memory after this change, but since we have been
freeing iova on each cpu in strict-mode for years, I think this change
seems reasonable to make the iova free logic identical to strict-mode.
This patch try to make the speed of allcating and freeing iova roughly
same by better utilizing the iova_rcache, or we will more likely enter
the slowpath. The save of memory consumption is actually at the cost of
performance, I am not sure if we need such a optimization for no-strict
mode which is mainly used for performance consideration.
As alluded to in the original cover letter, 100ms for IOVA_DEPOT_DELAY
was just my arbitrary value of "long enough" to keep the initial
implementation straightforward - I do expect that certain workloads
might benefit from tuning it differently, but without proof of what they
are and what they want, there's little justification for introducing
extra complexity and potential user ABI yet.
Let fq_flush_timeout()
freeing iovas on cpus who call dma_unmap* APIs, can decrease the overall
time caused by fq_flush_timeout() by better utilizing the iova_rcache,
and minimizing the competition for the iova_rbtree_lock().
I would have assumed that a single CPU simply throwing magazines into
the depot list from its own percpu cache is quicker, or at least no
slower, than doing the same while causing additional contention/sharing
by interfering with other percpu caches as well. And where does the
rbtree come into that either way? If an observable performance issue
actually exists here, I'd like a more detailed breakdown to understand it.
Thanks,
Robin.
This patch is firstly intent to minimize the chance of softlock issue in
fq_flush_timeout(), which is already dicribed erarlier in [1], which has
beed applied in a commercial kernel[2] for years.
However, the later tests show that this single patch is not enough to
fix the softlockup issue, since the root cause of softlockup is the
underlying iova_rbtree_lock. In our softlockup scenarios, the average
time cost to get this spinlock is about 6ms.
[1] https://lkml.org/lkml/2023/2/16/124
[2] https://gitee.com/openeuler/kernel/tree/OLK-5.10/
Thanks,
Zekun