RE: [PATCH v3 1/1] Documentation/core-api: Add swiotlb documentation

From: Michael Kelley
Date: Tue Apr 30 2024 - 11:48:54 EST


From: Petr Tesařík <petr@xxxxxxxxxxx> Sent: Tuesday, April 30, 2024 4:24 AM
> >
> > +Usage Scenarios
> > +---------------
> > +swiotlb was originally created to handle DMA for devices with addressing
> > +limitations. As physical memory sizes grew beyond 4 GiB, some devices could
> > +only provide 32-bit DMA addresses. By allocating bounce buffer memory below
> > +the 4 GiB line, these devices with addressing limitations could still work and
> > +do DMA.
>
> IIRC the origins are even older and bounce buffers were used to
> overcome the design flaws inherited all the way from the original IBM
> PC. These computers used an Intel 8237 for DMA. This chip has a 16-bit
> address register, but even the early 8088 CPUs had a 20-bit bus. So IBM
> added a separate 74LS670 4-by-4 register file chip to provide the high 4
> bits for each of the 4 DMA channels. As a side note, these bits were not
> updated when the 8237 address register was incrementing from 0xffff, so
> DMA would overflow at every 64K address boundary. PC AT then replaced
> these 4 bits with an 8-bit DMA "page" register to match the 24-bit
> address bus of an 80286. This design was not changed for 32-bit CPUs
> (i.e. 80386).
>
> In short, bounce buffers were not introduced because of 64-bit CPUs.
> They were already needed on 386 systems.
>
> OTOH this part of the history need not be mentioned in the
> documentation (unless you WANT it).

I knew there was some even earlier history, but I didn't know the
details. :-( I'll add some qualifying wording about there being multiple
DMA addressing limitations during the history of the x86 PCs, with
the 32-bit addressing as a more recent example. But I won't try to
cover the details of what you describe.

>
> > +
> > +More recently, Confidential Computing (CoCo) VMs have the guest VM's memory
> > +encrypted by default, and the memory is not accessible by the host hypervisor
> > +and VMM. For the host to do I/O on behalf of the guest, the I/O must be
> > +directed to guest memory that is unencrypted. CoCo VMs set a kernel-wide option
> > +to force all DMA I/O to use bounce buffers, and the bounce buffer memory is set
> > +up as unencrypted. The host does DMA I/O to/from the bounce buffer memory, and
> > +the Linux kernel DMA layer does "sync" operations to cause the CPU to copy the
> > +data to/from the original target memory buffer. The CPU copying bridges between
> > +the unencrypted and the encrypted memory. This use of bounce buffers allows
> > +existing device drivers to "just work" in a CoCo VM, with no modifications
> > +needed to handle the memory encryption complexity.
>
> This part might be misleading. It sounds as if SWIOTLB would not be
> needed if drivers were smarter.

I'm not sure I understand the point you are making. It is possible for a
driver to do its own manual bounce buffering to handle encrypted memory.
For example, in adding support for CoCo VMs, we encountered such a
driver/device with complex DMA and memory requirements that already
did some manual bounce buffering. When used in a CoCo VM, driver
modifications were needed to handle encrypted memory, but that was
the preferred approach because of the pre-existing manual bounce
buffering. In that case, indeed swiotlb was not needed by that driver/device.
But in the general case, we don't want to require driver modifications for
CoCo VMs. swiotlb bounce buffering makes it all work in the exactly the
situation you describe where the buffer memory could have originated
in a variety of places.

Could you clarify your point? Or perhaps suggest alternate wording
that would help avoid any confusion?

> But IIUC that's not the case. SWIOTLB
> is used for streaming DMA, where device drivers have little control
> over the physical placement of a DMA buffer. For example, when a
> process allocates some memory, the kernel cannot know that this memory
> will be later passed to a write(2) syscall to do direct I/O of a
> properly aligned buffer that can go all the way down to the NVMe driver
> with zero copy.
>
> > +
> > +Other edge case scenarios arise for bounce buffers. For example, when IOMMU
> > +mappings are set up for a DMA operation to/from a device that is considered
> > +"untrusted", the device should be given access only to the memory containing
> > +the data being transferred. But if that memory occupies only part of an IOMMU
> > +granule, other parts of the granule may contain unrelated kernel data. Since
> > +IOMMU access control is per-granule, the untrusted device can gain access to
> > +the unrelated kernel data. This problem is solved by bounce buffering the DMA
> > +operation and ensuring that unused portions of the bounce buffers do not
> > +contain any unrelated kernel data.
> > +
> > +Core Functionality
> > +------------------
> > +The primary swiotlb APIs are swiotlb_tbl_map_single() and
> > +swiotlb_tbl_unmap_single(). The "map" API allocates a bounce buffer of a
> > +specified size in bytes and returns the physical address of the buffer The
> > +buffer memory is physically contiguous. The expectation is that the DMA layer
> > +maps the physical memory address to a DMA address, and returns the DMA address
> > +to the driver for programming into the device. If a DMA operation specifies
> > +multiple memory buffer segments, a separate bounce buffer must be allocated for
> > +each segment. swiotlb_tbl_map_single() always does a "sync" operation (i.e., a
> > +CPU copy) to initialize the bounce buffer to match the contents of the original
> > +buffer.
> > +
> > +swiotlb_tbl_unmap_single() does the reverse. If the DMA operation updated the
> > +bounce buffer memory, the DMA layer does a "sync" operation to cause a CPU copy
> > +of the data from the bounce buffer back to the original buffer. Then the bounce
> > +buffer memory is freed.
>
> You may want to mention DMA_ATTR_SKIP_CPU_SYNC here.

Fair enough. I'll add a sentence.

>
> > +
> > +swiotlb also provides "sync" APIs that correspond to the dma_sync_*() APIs that
> > +a driver may use when control of a buffer transitions between the CPU and the
> > +device. The swiotlb "sync" APIs cause a CPU copy of the data between the
> > +original buffer and the bounce buffer. Like the dma_sync_*() APIs, the swiotlb
> > +"sync" APIs support doing a partial sync, where only a subset of the bounce
> > +buffer is copied to/from the original buffer.
> > +
> > +Core Functionality Constraints
> > +------------------------------
> > +The swiotlb map/unmap/sync APIs must operate without blocking, as they are
> > +called by the corresponding DMA APIs which may run in contexts that cannot
> > +block. Hence the default memory pool for swiotlb allocations must be
> > +pre-allocated at boot time (but see Dynamic swiotlb below). Because swiotlb
> > +allocations must be physically contiguous, the entire default memory pool is
> > +allocated as a single contiguous block.
>
> Allocations must be contiguous in target device's DMA address space. In
> practice this is achieved by being contiguous in CPU physical address
> space (aka "physically contiguous"), but there might be subtle
> differences, e.g. in a virtualized environment.
>
> Now that I'm thinking about it, leave the paragraph as is, and I'll
> update it if I write the code for it.

OK

>
> > +
> > +The need to pre-allocate the default swiotlb pool creates a boot-time tradeoff.
> > +The pool should be large enough to ensure that bounce buffer requests can
> > +always be satisfied, as the non-blocking requirement means requests can't wait
> > +for space to become available. But a large pool potentially wastes memory, as
> > +this pre-allocated memory is not available for other uses in the system. The
> > +tradeoff is particularly acute in CoCo VMs that use bounce buffers for all DMA
> > +I/O. These VMs use a heuristic to set the default pool size to ~6% of memory,
> > +with a max of 1 GiB, which has the potential to be very wasteful of memory.
> > +Conversely, the heuristic might produce a size that is insufficient, depending
> > +on the I/O patterns of the workload in the VM. The dynamic swiotlb feature
> > +described below can help, but has limitations. Better management of the swiotlb
> > +default memory pool size remains an open issue.
> > +
> > +A single allocation from swiotlb is limited to IO_TLB_SIZE * IO_TLB_SEGSIZE
> > +bytes, which is 256 KiB with current definitions. When a device's DMA settings
> > +are such that the device might use swiotlb, the maximum size of a DMA segment
> > +must be limited to that 256 KiB. This value is communicated to higher-level
> > +kernel code via dma_map_mapping_size() and swiotlb_max_mapping_size(). If the
> > +higher-level code fails to account for this limit, it may make requests that
> > +are too large for swiotlb, and get a "swiotlb full" error.
> > +
> > +A key device DMA setting is "min_align_mask". When set, swiotlb allocations are
> > +done so that the min_align_mask bits of the physical address of the bounce
>
> Let's be specific: the least significant min_align_mask bits.

Yes, being a little more specific is good. I'll change it as follows:

A key device DMA setting is "min_align"mask", which is a power of 2 minus 1,
so that some number of low order bits are set. swiotlb allocations ensure
these low order bits of the physical address of the bounce buffer match the
same bits in the address of the original buffer. If min_align_mask is non-zero, it
may produce an "alignment offset" in the address ....

>
> The rest of the document is perfect.
>
> Petr T