Re: [RFC PATCH] iommu/iova: Add a best-fit algorithm
From: Robin Murphy
Date: Thu Feb 20 2020 - 13:42:33 EST
On 20/02/2020 6:38 am, isaacm@xxxxxxxxxxxxxx wrote:
On 2020-02-17 08:03, Robin Murphy wrote:
On 14/02/2020 11:06 pm, Isaac J. Manjarres wrote:
From: Liam Mark <lmark@xxxxxxxxxxxxxx>
Using the best-fit algorithm, instead of the first-fit
algorithm, may reduce fragmentation when allocating
IOVAs.
What kind of pathological allocation patterns make that a serious
problem? Is there any scope for simply changing the order of things in
the callers? Do these drivers also run under other DMA API backends
(e.g. 32-bit Arm)?
The usecases where the IOVA space has been fragmented have
non-deterministic allocation
patterns, and thus, it's not feasible to change the allocation order to
avoid fragmenting
the IOVA space.
What about combining smaller buffers into larger individual allocations;
any scope for that sort of thing? Certainly if you're consistently
allocating small things less than PAGE_SIZE then DMA pools would be
useful to avoid wanton memory wastage in general.
From what we've observed, the usecases involve allocations of two types of
buffers: one type of buffer between 1 KB to 4 MB in size, and another
type of
buffer between 1 KB to 400 MB in size.
The pathological scenarios seem to arise when there are
many (100+) randomly distributed non-power of two allocations, which in
some cases leaves
behind holes of up to 100+ MB in the IOVA space.
Here are some examples that show the state of the IOVA space under which
failure to
allocate an IOVA was observed:
Instance 1:
ÂÂÂÂCurrently mapped total size : ~1.3GB
ÂÂÂÂFree space available : ~2GB
ÂÂÂÂMap for ~162MB fails.
ÂÂÂÂÂÂÂ Max contiguous space available : < 162MB
Instance 2:
ÂÂÂÂCurrently mapped total size : ~950MB
ÂÂÂÂFree space available : ~2.3GB
ÂÂÂÂMap for ~320MB fails.
ÂÂÂÂMax contiguous space available : ~189MB
Instance 3:
ÂÂÂÂCurrently mapped total size : ~1.2GB
ÂÂÂÂFree space available : ~2.7GB
ÂÂÂÂMap for ~162MB fails.
ÂÂÂÂMax contiguous space available : <162MB
We are still in the process of collecting data with the best-fit
algorithm enabled
to provide some numbers to show that it results in less IOVA space
fragmentation.
Thanks for those examples, and I'd definitely like to see the
comparative figures. To dig a bit further, at the point where things
start failing, where are the cached nodes pointing? IIRC there is still
a pathological condition where empty space between limit_pfn and
cached32_node gets 'lost' if nothing in between is freed, so the bigger
the range of allocation sizes, the worse the effect, e.g.:
(considering an empty domain, pfn 0 *not* reserved, 32-bit limit_pfn)
alloc 4K, succeeds, cached32_node now at 4G-4K
alloc 2G, succeeds, cached32_node now at 0
alloc 4K, fails despite almost 2G contiguous free space within limit_pfn
(and max32_alloc_size==1 now fast-forwards *any* further allocation
attempt to failure)
If you're falling foul of this case (I was never sure how realistic a
problem it would be in practice), there are at least a couple of much
less invasive tweaks I can think of that would be worth exploring.
To answer your question about whether if this driver run under other DMA
API backends:
yes, such as 32 bit ARM.
OK, that's what I suspected :)
AFAICS arch/arm's __alloc_iova() is also a first-fit algorithm, so if
you get better behaviour there then it would suggest that this aspect
isn't really the most important issue. Certainly, the fact that the
"best fit" logic here also happens to ignore the cached nodes does start
drawing me back to the point above.
Robin.