Re: [RFC PATCH] iommu/iova: Add a best-fit algorithm

From: Robin Murphy
Date: Thu Feb 20 2020 - 13:42:33 EST

On 20/02/2020 6:38 am, isaacm@xxxxxxxxxxxxxx wrote:
On 2020-02-17 08:03, Robin Murphy wrote:
On 14/02/2020 11:06 pm, Isaac J. Manjarres wrote:
From: Liam Mark <lmark@xxxxxxxxxxxxxx>

Using the best-fit algorithm, instead of the first-fit
algorithm, may reduce fragmentation when allocating

What kind of pathological allocation patterns make that a serious
problem? Is there any scope for simply changing the order of things in
the callers? Do these drivers also run under other DMA API backends
(e.g. 32-bit Arm)?

The usecases where the IOVA space has been fragmented have non-deterministic allocation
patterns, and thus, it's not feasible to change the allocation order to avoid fragmenting
the IOVA space.

What about combining smaller buffers into larger individual allocations; any scope for that sort of thing? Certainly if you're consistently allocating small things less than PAGE_SIZE then DMA pools would be useful to avoid wanton memory wastage in general.

From what we've observed, the usecases involve allocations of two types of
buffers: one type of buffer between 1 KB to 4 MB in size, and another type of
buffer between 1 KB to 400 MB in size.

The pathological scenarios seem to arise when there are
many (100+) randomly distributed non-power of two allocations, which in some cases leaves
behind holes of up to 100+ MB in the IOVA space.

Here are some examples that show the state of the IOVA space under which failure to
allocate an IOVA was observed:

Instance 1:
ÂÂÂÂCurrently mapped total size : ~1.3GB
ÂÂÂÂFree space available : ~2GB
ÂÂÂÂMap for ~162MB fails.
ÂÂÂÂÂÂÂ Max contiguous space available : < 162MB

Instance 2:
ÂÂÂÂCurrently mapped total size : ~950MB
ÂÂÂÂFree space available : ~2.3GB
ÂÂÂÂMap for ~320MB fails.
ÂÂÂÂMax contiguous space available : ~189MB

Instance 3:
ÂÂÂÂCurrently mapped total size : ~1.2GB
ÂÂÂÂFree space available : ~2.7GB
ÂÂÂÂMap for ~162MB fails.
ÂÂÂÂMax contiguous space available : <162MB

We are still in the process of collecting data with the best-fit algorithm enabled
to provide some numbers to show that it results in less IOVA space

Thanks for those examples, and I'd definitely like to see the comparative figures. To dig a bit further, at the point where things start failing, where are the cached nodes pointing? IIRC there is still a pathological condition where empty space between limit_pfn and cached32_node gets 'lost' if nothing in between is freed, so the bigger the range of allocation sizes, the worse the effect, e.g.:

(considering an empty domain, pfn 0 *not* reserved, 32-bit limit_pfn)

alloc 4K, succeeds, cached32_node now at 4G-4K
alloc 2G, succeeds, cached32_node now at 0
alloc 4K, fails despite almost 2G contiguous free space within limit_pfn
(and max32_alloc_size==1 now fast-forwards *any* further allocation attempt to failure)

If you're falling foul of this case (I was never sure how realistic a problem it would be in practice), there are at least a couple of much less invasive tweaks I can think of that would be worth exploring.

To answer your question about whether if this driver run under other DMA API backends:
yes, such as 32 bit ARM.

OK, that's what I suspected :)

AFAICS arch/arm's __alloc_iova() is also a first-fit algorithm, so if you get better behaviour there then it would suggest that this aspect isn't really the most important issue. Certainly, the fact that the "best fit" logic here also happens to ignore the cached nodes does start drawing me back to the point above.