Re: [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation

From: Tomasz Figa
Date: Wed Jan 13 2016 - 12:33:48 EST


On Wed, Jan 13, 2016 at 9:17 PM, Robin Murphy <robin.murphy@xxxxxxx> wrote:
> Hi Doug,
>
>
> On 08/01/16 23:05, Douglas Anderson wrote:
>>
>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>> sizeable buffers. Upon simple tests of video I saw it trying to
>> allocate 4,194,304 bytes. The function tries to allocate large chunks
>> in order to optimize IOMMU TLB usage.
>>
>> The current function is very, very slow.
>>
>> One problem is the way it keeps trying and trying to allocate big
>> chunks. Imagine a very fragmented memory that has 4M free but no
>> contiguous pages at all. Further imagine allocating 4M (1024 pages).
>> We'll do the following memory allocations:
>> - For page 1:
>> - Try to allocate order 10 (no retry)
>> - Try to allocate order 9 (no retry)
>> - ...
>> - Try to allocate order 0 (with retry, but not needed)
>> - For page 2:
>> - Try to allocate order 9 (no retry)
>> - Try to allocate order 8 (no retry)
>> - ...
>> - Try to allocate order 0 (with retry, but not needed)
>> - ...
>> - ...
>>
>> Total number of calls to alloc() calls for this case is:
>> sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>> => 9228
>>
>> The above is obviously worse case, but given how slow alloc can be we
>> really want to try to avoid even somewhat bad cases. I timed the old
>> code with a device under memory pressure and it wasn't hard to see it
>> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
>> was done on kernel 3.14, so possibly mainline would behave
>> differently).
>>
>> A second problem is that allocating big chunks under memory pressure
>> when we don't need them is just not a great idea anyway unless we really
>> need them. We can make due pretty well with smaller chunks so it's
>> probably wise to leave bigger chunks for other users once memory
>> pressure is on.
>>
>> Let's adjust the allocation like this:
>>
>> 1. If a big chunk fails, stop trying to hard and bump down to lower
>> order allocations.
>> 2. Don't try useless orders. The whole point of big chunks is to
>> optimize the TLB and it can really only make use of 2M, 1M, 64K and
>> 4K sizes.
>>
>> We'll still tend to eat up a bunch of big chunks, but that might be the
>> right answer for some users. A future patch could possibly add a new
>> DMA_ATTR that would let the caller decide that TLB optimization isn't
>> important and that we should use smaller chunks. Presumably this would
>> be a sane strategy for some callers.
>
>
> Now that I've had time to think about it properly:
>
> Reviewed-by: Robin Murphy <robin.murphy@xxxxxxx>
>
> I just had an absolutely disgusting idea of how to get the same progression
> with just a single variable and no static array, but I'll keep that firmly
> to myself as it's almost IOCCC-grade WTF :D

Just out of curiosity, a bitmap and loop with fls() and clearing bit
on failure or something more freaky? :)

Anyway:

Reviewed-by: Tomasz Figa <tfiga@xxxxxxxxxxxx>

Best regards,
Tomasz