Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

From: Petr Tesařík
Date: Tue May 16 2023 - 02:17:00 EST


Hi Michael,

On Mon, 15 May 2023 19:43:52 +0000
"Michael Kelley (LINUX)" <mikelley@xxxxxxxxxxxxx> wrote:

> From: Petr Tesarik <petrtesarik@xxxxxxxxxxxxxxx> Sent: Tuesday, May 9, 2023 2:18 AM
> >
> > The software IO TLB was designed with the assumption that it is not
> > used much, especially on 64-bit systems, so a small fixed memory
> > area (currently 64 MiB) is sufficient to handle the few cases which
> > still require a bounce buffer. However, these cases are not so rare
> > in some circumstances.
> >
> > First, if SEV is active, all DMA must be done through shared
> > unencrypted pages, and SWIOTLB is used to make this happen without
> > changing device drivers. The software IO TLB size is increased to 6%
> > of total memory in sev_setup_arch(), but that is more of an
> > approximation. The actual requirements may vary depending on which
> > drivers are used and the amount of I/O.
>
> FWIW, I don't think the approach you have implemented here will be
> practical to use for CoCo VMs (SEV, TDX, whatever else). The problem
> is that dma_direct_alloc_pages() and dma_direct_free_pages() must
> call dma_set_decrypted() and dma_set_encrypted(), respectively. In CoCo
> VMs, these calls are expensive because they require a hypercall to the host,
> and the operation on the host isn't trivial either. I haven't measured the
> overhead, but doing a hypercall on every DMA map operation and on
> every unmap operation has long been something we thought we must
> avoid. The fixed swiotlb bounce buffer space solves this problem by
> doing set_decrypted() in batch at boot time, and never
> doing set_encrypted().

I know. For performance, alloc() and free() on each DMA map/unmap is
not ideal even without CoCo. I have already tested some code locally
to keep buffers around after unmap, effectively creating a per-device
pool as described below. Right now, I don't have SEV-capable hardware
for testing, but on bare metal, this pool brought performance back to
that of fixed swiotlb buffers, for some scenarios making it even better.

However, these per-device pools add more complexity, so I decided to
start with a smaller patch series that can be reviewed more easily. If
there is no objection in general, I'll send the second part with the
per-device pools.

> In Microsoft's first implementation of bounce buffering for SEV-SNP VMs,
> we created custom bounce buffer code separate from swiotlb. This code
> did similar what you've done, but maintained a per-device pool of allocated
> buffers that could be reused, rather than freeing the memory (and marking
> the memory encrypted again) on every DMA unmap operation. (The pool
> was actually per-VMBus channel, but VMBus channels are per-device, so
> the effect was the same.) The reusable pool avoided most of the calls to
> set_decrypted()/set_encrypted() and made it practical from a performance
> standpoint. But of course, the pool could grow arbitrarily large, so there
> was additional complexity to decay and trim the pool size. LKML feedback
> early on was to use swiotlb instead, which made sense, but at the cost of
> needing to figure out the appropriate fixed size of the swiotlb, and likely
> over-provisioning to avoid running out of bounce buffer space.
>
> Now we're considering again a more dynamic approach, which is good, but
> we're encountering the same problems.
>
> See https://lore.kernel.org/linux-hyperv/20210228150315.2552437-1-ltykernel@xxxxxxxxx/
> for this historical example.

Thanks for the pointer. I'll definitely have a look!

Petr T

> Michael
>
> >
> > Second, some embedded devices have very little RAM, so 64 MiB is not
> > negligible. Sadly, these are exactly the devices that also often
> > need a software IO TLB. Although minimum swiotlb size can be found
> > empirically by extensive testing, it would be easier to allocate a
> > small swiotlb at boot and let it grow on demand.
> >
> > Growing the SWIOTLB data structures at run time is impossible. The
> > whole SWIOTLB region is contiguous in physical memory to allow
> > combining adjacent slots and also to ensure that alignment
> > constraints can be met. The SWIOTLB is too big for the buddy
> > allocator (cf. MAX_ORDER). More importantly, even if a new SWIOTLB
> > could be allocated (e.g. from CMA), it cannot be extended in-place
> > (because surrounding pages may be already allocated for other
> > purposes), and there is no mechanism for relocating already mapped
> > bounce buffers: The DMA API gets only the address of a buffer, and
> > the implementation (direct or IOMMU) checks whether it belongs to
> > the software IO TLB.
> >
> > It is possible to allocate multiple smaller struct io_tlb_mem
> > instances. However, they would have to be stored in a non-constant
> > container (list or tree), which needs synchronization between
> > readers and writers, creating contention in a hot path for all
> > devices, not only those which need software IO TLB.
> >
> > Another option is to allocate a very large SWIOTLB at boot, but
> > allow migrating pages to other users (like CMA does). This approach
> > might work, but there are many open issues:
> >
> > 1. After a page is migrated away from SWIOTLB, it must not be used
> > as a (direct) DMA buffer. Otherwise SWIOTLB code would have to
> > check which pages have been migrated to determine whether a given
> > buffer address belongs to a bounce buffer or not, effectively
> > introducing all the issues of multiple SWIOTLB instances.
> >
> > 2. Unlike SWIOTLB, CMA cannot be used from atomic contexts, and that
> > for many different reasons. This might be changed in theory, but
> > it would take a lot of investigation and time. OTOH improvement
> > to the SWIOTLB is needed now.
> >
> > 3. If SWIOTLB is implemented separately from CMA and not as its
> > part, users have to solve the dilemma of how to distribute
> > precious DMA-able memory between them.
> >
> > The present patch is a simplistic solution. Bounce buffers are
> > allocated using the non-coherent DMA API, removing the need to
> > implement a new custom allocator. These buffers are then tracked in
> > a per-device linked list, reducing the impact on devices that do not
> > need the SWIOTLB.
> >
> > Analysis of real-world I/O patterns has shown that the same buffer
> > is typically looked up repeatedly (for further sync operations, or
> > to be unmapped). The most recently used bounce buffer is therefore
> > always moved to the beginning of the list. The list performed better
> > than a maple tree when tested with fio against a QEMU SATA drive
> > backed by a RAM block device in the host (4 cores, 16 iodepth).
> > Other scenarios are also likely to benefit from this MRU order but
> > have not been tested.
> >
> > Operations on the list are serialized with a spinlock. It is
> > unfortunately not possible to use an RCU list, because quiescent
> > state is not guaranteed to happen before an asynchronous event (e.g.
> > hardware interrupt) on another CPU. If that CPU used an old version
> > of the list, it would fail to copy data to and/or from the newly
> > allocated bounce buffer.
> >
> > Last but not least, bounce buffers are never allocated dynamically
> > if the SWIOTLB is in fact a DMA restricted pool, because that would
> > defeat the purpose of a restricted pool.
> >
> > Signed-off-by: Petr Tesarik <petr.tesarik.ext@xxxxxxxxxx>
> > ---
> > include/linux/device.h | 8 ++
> > include/linux/swiotlb.h | 8 +-
> > kernel/dma/swiotlb.c | 252 ++++++++++++++++++++++++++++++++++++++--
> > 3 files changed, 259 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index 472dd24d4823..d1d2b8557b30 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -510,6 +510,12 @@ struct device_physical_location {
> > * @dma_mem: Internal for coherent mem override.
> > * @cma_area: Contiguous memory area for dma allocations
> > * @dma_io_tlb_mem: Pointer to the swiotlb pool used. Not for driver use.
> > + * @dma_io_tlb_dyn_lock:
> > + * Spinlock to protect the list of dynamically allocated bounce
> > + * buffers.
> > + * @dma_io_tlb_dyn_slots:
> > + * Dynamically allocated bounce buffers for this device.
> > + * Not for driver use.
> > * @archdata: For arch-specific additions.
> > * @of_node: Associated device tree node.
> > * @fwnode: Associated device node supplied by platform firmware.
> > @@ -615,6 +621,8 @@ struct device {
> > #endif
> > #ifdef CONFIG_SWIOTLB
> > struct io_tlb_mem *dma_io_tlb_mem;
> > + spinlock_t dma_io_tlb_dyn_lock;
> > + struct list_head dma_io_tlb_dyn_slots;
> > #endif
> > /* arch specific additions */
> > struct dev_archdata archdata;
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index 281ecc6b9bcc..6aada6ac31e2 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -114,6 +114,8 @@ struct io_tlb_mem {
> > };
> > extern struct io_tlb_mem io_tlb_default_mem;
> >
> > +bool is_swiotlb_dyn(struct device *dev, phys_addr_t paddr);
> > +
> > /**
> > * is_swiotlb_fixed() - check if a physical address belongs to a swiotlb slot
> > * @mem: relevant swiotlb pool
> > @@ -147,7 +149,9 @@ static inline bool is_swiotlb_buffer(struct device *dev,
> > phys_addr_t paddr)
> > {
> > struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> >
> > - return mem && is_swiotlb_fixed(mem, paddr);
> > + return mem &&
> > + (is_swiotlb_fixed(mem, paddr) ||
> > + is_swiotlb_dyn(dev, paddr));
> > }
> >
> > static inline bool is_swiotlb_force_bounce(struct device *dev)
> > @@ -164,6 +168,8 @@ static inline bool is_swiotlb_force_bounce(struct device *dev)
> > static inline void swiotlb_dev_init(struct device *dev)
> > {
> > dev->dma_io_tlb_mem = &io_tlb_default_mem;
> > + spin_lock_init(&dev->dma_io_tlb_dyn_lock);
> > + INIT_LIST_HEAD(&dev->dma_io_tlb_dyn_slots);
> > }
> >
> > void swiotlb_init(bool addressing_limited, unsigned int flags);
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index 96ba93be6772..612e1c2e9573 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -68,6 +68,22 @@ struct io_tlb_slot {
> > unsigned int list;
> > };
> >
> > +/**
> > + * struct io_tlb_dyn_slot - dynamically allocated swiotlb slot
> > + * @node: node in the per-device list
> > + * @orig_addr: physical address of the original DMA buffer
> > + * @alloc_size: total size of the DMA buffer
> > + * @page: first page of the bounce buffer
> > + * @dma_addr: DMA address of the bounce buffer
> > + */
> > +struct io_tlb_dyn_slot {
> > + struct list_head node;
> > + phys_addr_t orig_addr;
> > + size_t alloc_size;
> > + struct page *page;
> > + dma_addr_t dma_addr;
> > +};
> > +
> > static bool swiotlb_force_bounce;
> > static bool swiotlb_force_disable;
> >
> > @@ -466,6 +482,191 @@ void __init swiotlb_exit(void)
> > memset(mem, 0, sizeof(*mem));
> > }
> >
> > +/**
> > + * lookup_dyn_slot_locked() - look up a dynamically allocated bounce buffer
> > + * @dev: device which has mapped the buffer
> > + * @paddr: physical address within the bounce buffer
> > + *
> > + * Walk through the list of bounce buffers dynamically allocated for @dev,
> > + * looking for a buffer which contains @paddr.
> > + *
> > + * Context: Any context. Expects dma_io_tlb_dyn_lock lock to be held.
> > + * Return:
> > + * Address of a &struct io_tlb_dyn_slot, or %NULL if not found.
> > + */
> > +static struct io_tlb_dyn_slot *lookup_dyn_slot_locked(struct device *dev,
> > + phys_addr_t paddr)
> > +{
> > + struct io_tlb_dyn_slot *slot;
> > +
> > + list_for_each_entry(slot, &dev->dma_io_tlb_dyn_slots, node) {
> > + phys_addr_t start = page_to_phys(slot->page);
> > + phys_addr_t end = start + slot->alloc_size - 1;
> > +
> > + if (start <= paddr && paddr <= end)
> > + return slot;
> > + }
> > + return NULL;
> > +}
> > +
> > +/**
> > + * lookup_dyn_slot() - look up a dynamically allocated bounce buffer
> > + * @dev: device which has mapped the buffer
> > + * @paddr: physical address within the bounce buffer
> > + *
> > + * Search for a dynamically allocated bounce buffer which contains
> > + * @paddr. If found, the buffer is moved to the beginning of the linked
> > + * list. Use lookup_dyn_slot_locked() directly where this behavior is not
> > + * required/desired.
> > + *
> > + * Context: Any context. Takes and releases dma_io_tlb_dyn_lock.
> > + * Return:
> > + * Address of a &struct io_tlb_dyn_slot, or %NULL if not found.
> > + */
> > +static struct io_tlb_dyn_slot *lookup_dyn_slot(struct device *dev,
> > + phys_addr_t paddr)
> > +{
> > + struct io_tlb_dyn_slot *slot;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> > + slot = lookup_dyn_slot_locked(dev, paddr);
> > + list_move(&slot->node, &dev->dma_io_tlb_dyn_slots);
> > + spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> > + return slot;
> > +}
> > +
> > +/**
> > + * is_swiotlb_dyn() - check if a physical address belongs to a dynamically
> > + * allocated bounce buffer
> > + * @dev: device which has mapped the buffer
> > + * @paddr: physical address within the bounce buffer
> > + *
> > + * Check whether there is a dynamically allocated bounce buffer which
> > + * contains @paddr. If found, the buffer is moved to the beginning of
> > + * the MRU list.
> > + *
> > + * Return:
> > + * * %true if @paddr points into a dynamically allocated bounce buffer
> > + * * %false otherwise
> > + */
> > +bool is_swiotlb_dyn(struct device *dev, phys_addr_t paddr)
> > +{
> > + return !!lookup_dyn_slot(dev, paddr);
> > +}
> > +
> > +/**
> > + * swiotlb_dyn_bounce() - copy a dynamically allocated buffer from or back
> > + * to the original dma location
> > + * @dev: device which has mapped the buffer
> > + * @tlb_addr: physical address inside the bounce buffer
> > + * @size: size of the region to be copied
> > + * @dir: direction of the data transfer
> > + *
> > + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> > + * This function works only for dynamically allocated bounce buffers.
> > + */
> > +static void swiotlb_dyn_bounce(struct device *dev, phys_addr_t tlb_addr,
> > + size_t size, enum dma_data_direction dir)
> > +{
> > + struct io_tlb_dyn_slot *slot = lookup_dyn_slot(dev, tlb_addr);
> > + unsigned int tlb_offset;
> > + unsigned char *vaddr;
> > +
> > + if (!slot)
> > + return;
> > +
> > + tlb_offset = tlb_addr - page_to_phys(slot->page);
> > + vaddr = page_address(slot->page) + tlb_offset;
> > +
> > + swiotlb_copy(dev, slot->orig_addr, vaddr, size, slot->alloc_size,
> > + tlb_offset, dir);
> > +}
> > +
> > +/**
> > + * swiotlb_dyn_map() - allocate a bounce buffer dynamically
> > + * @dev: device which maps the buffer
> > + * @orig_addr: address of the original buffer
> > + * @alloc_size: total size of the original buffer
> > + * @alloc_align_mask:
> > + * required physical alignment of the I/O buffer
> > + * @dir: direction of the data transfer
> > + * @attrs: optional DMA attributes for the map operation
> > + *
> > + * Allocate a new bounce buffer using DMA non-coherent API. This function
> > + * assumes that there is a fallback allocation scheme if the allocation
> > + * fails. In fact, it always fails for buffers smaller than a page and
> > + * for address constraints that are not (yet) correctly handled by
> > + * dma_direct_alloc_pages().
> > + *
> > + * Return: Physical address of the bounce buffer, or %DMA_MAPPING_ERROR.
> > + */
> > +static phys_addr_t swiotlb_dyn_map(struct device *dev, phys_addr_t orig_addr,
> > + size_t alloc_size, unsigned int alloc_align_mask,
> > + enum dma_data_direction dir, unsigned long attrs)
> > +{
> > + struct io_tlb_dyn_slot *slot;
> > + unsigned long flags;
> > + gfp_t gfp;
> > +
> > + /* Allocation has page granularity. Avoid small buffers. */
> > + if (alloc_size < PAGE_SIZE)
> > + goto err;
> > +
> > + /* DMA direct does not deal with physical address constraints. */
> > + if (alloc_align_mask || dma_get_min_align_mask(dev))
> > + goto err;
> > +
> > + gfp = (attrs & DMA_ATTR_MAY_SLEEP) ? GFP_NOIO : GFP_NOWAIT;
> > + slot = kmalloc(sizeof(*slot), gfp | __GFP_NOWARN);
> > + if (!slot)
> > + goto err;
> > +
> > + slot->orig_addr = orig_addr;
> > + slot->alloc_size = alloc_size;
> > + slot->page = dma_direct_alloc_pages(dev, PAGE_ALIGN(alloc_size),
> > + &slot->dma_addr, dir,
> > + gfp | __GFP_NOWARN);
> > + if (!slot->page)
> > + goto err_free_slot;
> > +
> > + spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> > + list_add(&slot->node, &dev->dma_io_tlb_dyn_slots);
> > + spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> > +
> > + return page_to_phys(slot->page);
> > +
> > +err_free_slot:
> > + kfree(slot);
> > +err:
> > + return (phys_addr_t)DMA_MAPPING_ERROR;
> > +}
> > +
> > +/**
> > + * swiotlb_dyn_unmap() - unmap a dynamically allocated bounce buffer
> > + * @dev: device which mapped the buffer
> > + * @tlb_addr: physical address of the bounce buffer
> > + * @dir: direction of the data transfer
> > + *
> > + * Release all resources associated with a dynamically allocated bounce
> > + * buffer.
> > + */
> > +static void swiotlb_dyn_unmap(struct device *dev, phys_addr_t tlb_addr,
> > + enum dma_data_direction dir)
> > +{
> > + struct io_tlb_dyn_slot *slot;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> > + slot = lookup_dyn_slot_locked(dev, tlb_addr);
> > + list_del(&slot->node);
> > + spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> > +
> > + dma_direct_free_pages(dev, slot->alloc_size, slot->page,
> > + slot->dma_addr, dir);
> > + kfree(slot);
> > +}
> > +
> > /*
> > * Return the offset into a iotlb slot required to keep the device happy.
> > */
> > @@ -474,11 +675,19 @@ static unsigned int swiotlb_align_offset(struct device *dev,
> > u64 addr)
> > return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
> > }
> >
> > -/*
> > - * Bounce: copy the swiotlb buffer from or back to the original dma location
> > +/**
> > + * swiotlb_fixed_bounce() - copy a fixed-slot swiotlb buffer from or back
> > + * to the original dma location
> > + * @dev: device which has mapped the buffer
> > + * @tlb_addr: physical address inside the bounce buffer
> > + * @size: size of the region to be copied
> > + * @dir: direction of the data transfer
> > + *
> > + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> > + * This function works only for fixed bounce buffers.
> > */
> > -static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
> > - enum dma_data_direction dir)
> > +static void swiotlb_fixed_bounce(struct device *dev, phys_addr_t tlb_addr,
> > + size_t size, enum dma_data_direction dir)
> > {
> > struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> > int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
> > @@ -574,6 +783,25 @@ static void swiotlb_copy(struct device *dev, phys_addr_t
> > orig_addr,
> > }
> > }
> >
> > +/**
> > + * swiotlb_bounce() - copy the swiotlb buffer from or back to the original
> > + * dma location
> > + * @dev: device which has mapped the buffer
> > + * @tlb_addr: physical address inside the bounce buffer
> > + * @size: size of the region to be copied
> > + * @dir: direction of the data transfer
> > + *
> > + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> > + */
> > +static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
> > + enum dma_data_direction dir)
> > +{
> > + if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> > + swiotlb_fixed_bounce(dev, tlb_addr, size, dir);
> > + else
> > + swiotlb_dyn_bounce(dev, tlb_addr, size, dir);
> > +}
> > +
> > static inline phys_addr_t slot_addr(phys_addr_t start, phys_addr_t idx)
> > {
> > return start + (idx << IO_TLB_SHIFT);
> > @@ -834,8 +1062,13 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> > phys_addr_t orig_addr,
> > return (phys_addr_t)DMA_MAPPING_ERROR;
> > }
> >
> > - tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
> > - alloc_align_mask, attrs);
> > + tlb_addr = (phys_addr_t)DMA_MAPPING_ERROR;
> > + if (!is_swiotlb_for_alloc(dev))
> > + tlb_addr = swiotlb_dyn_map(dev, orig_addr, alloc_size,
> > + alloc_align_mask, dir, attrs);
> > + if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
> > + tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
> > + alloc_align_mask, attrs);
> >
> > if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR) {
> > if (!(attrs & DMA_ATTR_NO_WARN))
> > @@ -919,7 +1152,10 @@ void swiotlb_tbl_unmap_single(struct device *dev,
> > phys_addr_t tlb_addr,
> > (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
> > swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
> >
> > - swiotlb_release_slots(dev, tlb_addr);
> > + if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> > + swiotlb_release_slots(dev, tlb_addr);
> > + else
> > + swiotlb_dyn_unmap(dev, tlb_addr, dir);
> > }
> >
> > void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
> > @@ -1089,7 +1325,7 @@ bool swiotlb_free(struct device *dev, struct page *page,
> > size_t size)
> > {
> > phys_addr_t tlb_addr = page_to_phys(page);
> >
> > - if (!is_swiotlb_buffer(dev, tlb_addr))
> > + if (!is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> > return false;
> >
> > swiotlb_release_slots(dev, tlb_addr);
> > --
> > 2.25.1
>