Re: [RFC v1 3/4] swiotlb: Allow dynamic allocation of bounce buffers

From: Petr Tesarik
Date: Tue Mar 28 2023 - 03:55:33 EST

Next message: Mathias Nyman: "Re: [PATCH] xhci: Free the command allocated for setting LPM if we return early"
Previous message: Jan Kara: "Re: [PATCH 3/3] fanotify: use pidfd_prepare()"
In reply to: Christoph Hellwig: "Re: [RFC v1 3/4] swiotlb: Allow dynamic allocation of bounce buffers"
Next in thread: Petr Tesarik: "Re: [RFC v1 3/4] swiotlb: Allow dynamic allocation of bounce buffers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 3/28/2023 6:07 AM, Christoph Hellwig wrote:
> [adding Alex as he has been interested in this in the past]
>
> On Mon, Mar 20, 2023 at 01:28:15PM +0100, Petr Tesarik wrote:
>> Second, on the Raspberry Pi 4, swiotlb is used by dma-buf for pages
>> moved from the rendering GPU (v3d driver), which can access all
>> memory, to the display output (vc4 driver), which is connected to a
>> bus with an address limit of 1 GiB and no IOMMU. These buffers can
>> be large (several megabytes) and cannot be handled by SWIOTLB,
>> because they exceed maximum segment size of 256 KiB. Such mapping
>> failures can be easily reproduced on a Raspberry Pi4: Starting
>> GNOME remote desktop results in a flood of kernel messages like
>> these:
>
> Shouldn't we make sure dma-buf allocates the buffers for the most
> restricted devices, and more importantly does something like a dma
> coherent allocation instead of a dynamic mapping of random memory?
>
> While a larger swiotlb works around this I don't think this fixes the root
> cause.

I tend to agree here. However, it's the DMABUF design itself that causes
some trouble. The buffer is allocated by the v3d driver, which does not
have the restriction, so the DMA API typically allocates an address
somewhere near the 4G boundary. Userspace then exports the buffer, sends
it to another process as a file descriptor and imports it into the vc4
driver, which requires DMA below 1G. In the beginning, v3d had no idea
that the buffer would be exported to userspace, much less that it would
be later imported into vc4.

Anyway, I suspected that the buffers need not be imported into the vc4
driver (also hinted by Eric Anholt in a 2018 blog post [1]), and it
seems I was right. I encountered the issue with Ubuntu 22.10; I
installed latest openSUSE Tumbleweed yesterday, and I was not able to
reproduce the issue there, most likely because the Mesa drivers have
been fixed meanwhile. This makes the specific case of the Raspberry Pi 4
drivers moot. The issue may still affect other combinations of drivers,
but I don't have any other real-world example ATM.

[1] https://anholt.github.io/twivc4/2018/02/12/twiv/

>> 1. The value is limited to ULONG_MAX, which is too little both for
>> physical addresses (e.g. x86 PAE or 32-bit ARM LPAE) and DMA
>> addresses (e.g. Xen guests on 32-bit ARM).
>>
>> 2. Since buffers are currently allocated with page granularity, a
>> PFN can be used instead. However, some values are reserved by
>> the maple tree implementation. Liam suggests to use
>> xa_mk_value() in that case, but that reduces the usable range by
>> half. Luckily, 31 bits are still enough to hold a PFN on all
>> 32-bit platforms.
>>
>> 3. Software IO TLB is used from interrupt context. The maple tree
>> implementation is not IRQ-safe (MT_FLAGS_LOCK_IRQ does nothing
>> AFAICS). Instead, I use an external lock, spin_lock_irqsave() and
>> spin_unlock_irqrestore().
>>
>> Note that bounce buffers are never allocated dynamically if the
>> software IO TLB is in fact a DMA restricted pool, which is intended
>> to be stay in its designated location in physical memory.
>
> I'm a little worried about all that because it causes quite a bit
> of overhead even for callers that don't end up going into the
> dynamic range or do not use swiotlb at all. I don't really have a
> good answer here except for the usual avoid bounce buffering whenever
> you can that might not always be easy to do.

I'm also worried about all this overhead. OTOH I was not able to confirm
it, because the difference between two successive fio test runs on an
unmodified kernel was bigger than the difference between a vanilla and a
patched kernel, except the maximum completion latency, which OTOH
affected less than 0.01% of all requests.

BTW my testing also suggests that the streaming DMA API is quite
inefficient, because UAS performance _improved_ with swiotlb=force.
Sure, this should probably be addressed in the UAS and/or xHCI driver,
but what I mean is that moving away from swiotlb may even cause
performance regressions, which is counter-intuitive. At least I would
_not_ have expected it.

>> + gfp = (attrs & DMA_ATTR_MAY_SLEEP) ? GFP_KERNEL : GFP_NOWAIT;
>> + slot = kmalloc(sizeof(*slot), gfp | __GFP_NOWARN);
>> + if (!slot)
>> + goto err;
>> +
>> + slot->orig_addr = orig_addr;
>> + slot->alloc_size = alloc_size;
>> + slot->page = dma_direct_alloc_pages(dev, PAGE_ALIGN(alloc_size),
>> + &slot->dma_addr, dir,
>> + gfp | __GFP_NOWARN);
>> + if (!slot->page)
>> + goto err_free_slot;
>
> Without GFP_NOIO allocations this will deadlock eventually.

Ah, that would affect the non-sleeping case (GFP_KERNEL), right?

Petr T

Next message: Mathias Nyman: "Re: [PATCH] xhci: Free the command allocated for setting LPM if we return early"
Previous message: Jan Kara: "Re: [PATCH 3/3] fanotify: use pidfd_prepare()"
In reply to: Christoph Hellwig: "Re: [RFC v1 3/4] swiotlb: Allow dynamic allocation of bounce buffers"
Next in thread: Petr Tesarik: "Re: [RFC v1 3/4] swiotlb: Allow dynamic allocation of bounce buffers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]