Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"

From: John Garry
Date: Mon Mar 08 2021 - 11:25:14 EST


On 08/03/2021 15:15, Robin Murphy wrote:
I figure that you're talking about 4e89dce72521 now. I would have liked to know which real-life problem it solved in practice.

From what I remember, the problem reported was basically the one illustrated in that commit and the one I alluded to above - namely that certain allocation patterns with a broad mix of sizes and relative lifetimes end up pushing the cached PFN down to the bottom of the address space such that allocations start failing despite there still being sufficient free space overall, which was breaking some media workload. What was originally proposed was an overcomplicated palaver with DMA attributes and a whole extra allocation algorithm rather than just fixing the clearly unintended and broken behaviour.

ok, fine. I just wondered if this was a theoretical problem only.


While max32_alloc_size indirectly tracks the largest*contiguous* available space, one of the ideas from which it grew was to simply keep
count of the total number of free PFNs. If you're really spending
significant time determining that the tree is full, as opposed to just
taking longer to eventually succeed, then it might be relatively
innocuous to tack on that semi-redundant extra accounting as a
self-contained quick fix for that worst case.


...


Even if it is were configurable, wouldn't it make sense to have it configurable per IOVA domain?

Perhaps, but I don't see that being at all easy to implement. We can't arbitrarily *increase* the scope of caching once a domain is active due to the size-rounding-up requirement, which would be prohibitive to larger allocations if applied universally.


Agreed.

But having that (all IOVAs sizes being cacheable) available could be really great, though, for some situations.

Furthermore, as mentioned above, I still want to solve this IOVA aging issue, and this fixed RCACHE RANGE size seems to be the at the center of that problem.


As for 4e89dce72521, so even if it's proper to retry for a failed alloc,
it is not always necessary. I mean, if we're limiting ourselves to 32b
subspace for this SAC trick and we fail the alloc, then we can try the
space above 32b first (if usable). If that fails, then retry there. I
don't see a need to retry the 32b subspace if we're not limited to it.
How about it? We tried that idea and it looks to just about restore
performance.
The thing is, if you do have an actual PCI device where DAC might mean a
33% throughput loss and you're mapping a long-lived buffer, or you're on
one of these systems where firmware fails to document address limits and
using the full IOMMU address width quietly breaks things, then you
almost certainly*do*  want the allocator to actually do a proper job of
trying to satisfy the given request.

If those conditions were true, then it seems quite a tenuous position, so trying to help that scenario in general terms will have limited efficacy.

Still, I'd be curious to see if making the restart a bit cleverer offers a noticeable improvement. IIRC I suggested it at the time, but in the end the push was just to get *something* merged.

Sorry to say, I just tested that ("iommu/iova: Improve restart logic") and there is no obvious improvement.

I'll have a further look at what might be going on.

Thanks very much,
John