Re: [PATCH RFC v2 1/1] wiotlb: split buffer into 32-bit default and 64-bit extra zones

From: Christoph Hellwig
Date: Tue Sep 20 2022 - 03:55:38 EST


On Sat, Aug 20, 2022 at 12:42:50AM -0700, Dongli Zhang wrote:
> Hello,
>
> I used to send out RFC v1 to introduce an extra io_tlb_mem (created with
> SWIOTLB_ANY) in addition to the default io_tlb_mem (32-bit). The
> dev->dma_io_tlb_mem is set to either default or the extra io_tlb_mem,
> depending on dma mask. However, that is not good for setting
> dev->dma_io_tlb_mem at swiotlb layer transparently as suggested by
> Christoph Hellwig.
>
> https://lore.kernel.org/all/20220609005553.30954-1-dongli.zhang@xxxxxxxxxx/
>
> Therefore, this is another RFC v2 implementation following a different
> direction. The core ideas are:
>
> 1. The swiotlb is splited into two zones, io_tlb_mem->zone[0] (32-bit) and
> io_tlb_mem->zone[1] (64-bit).
>
> struct io_tlb_mem {
> struct io_tlb_zone zone[SWIOTLB_NR];
> struct dentry *debugfs;
> bool late_alloc;
> bool force_bounce;
> bool for_alloc;
> bool has_extra;
> };
>
> struct io_tlb_zone {
> phys_addr_t start;
> phys_addr_t end;
> void *vaddr;
> unsigned long nslabs;
> unsigned long used;
> unsigned int nareas;
> unsigned int area_nslabs;
> struct io_tlb_area *areas;
> struct io_tlb_slot *slots;
> };
>
> 2. By default, only io_tlb_mem->zone[0] is available. The
> io_tlb_mem->zone[1] is allocated conditionally if:
>
> - the "swiotlb=" is configured to allocate extra buffer, and
> - the SWIOTLB_EXTRA is set in the flag (this is to make sure arch(s) other
> than x86/sev/xen will not enable it until it is fully tested by each
> arch, e.g., mips/powerpc). Currently it is enabled for x86 and xen.
>
> 3. During swiotlb map, whether zone[0] (32-bit) or zone[1] (64-bit
> SWIOTLB_ANY)
> is used depends on min_not_zero(*dev->dma_mask, dev->bus_dma_limit).
>
> To test the RFC v2, here is the QEMU command line.
>
> qemu-system-x86_64 -smp 8 -m 32G -enable-kvm -vnc :5 -hda disk.img \
> -kernel path-to-linux/arch/x86_64/boot/bzImage \
> -append "root=/dev/sda1 init=/sbin/init text console=ttyS0 loglevel=7 swiotlb=32768,4194304,force" \
> -net nic -net user,hostfwd=tcp::5025-:22 \
> -device nvme,drive=nvme01,serial=helloworld -drive file=test.qcow2,if=none,id=nvme01 \
> -serial stdio
>
> There is below in syslog. The extra 8GB buffer is allocated.
>
> [ 0.152251] software IO TLB: area num 8.
> ... ...
> [ 3.706088] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> [ 3.707334] software IO TLB: mapped default [mem 0x00000000bbfd7000-0x00000000bffd7000] (64MB)
> [ 3.708585] software IO TLB: mapped extra [mem 0x000000061cc00000-0x000000081cc00000] (8192MB)
>
> After the FIO is triggered over NVMe, the 64-bit buffer is used.
>
> $ cat /sys/kernel/debug/swiotlb/io_tlb_nslabs_extra
> 4194304
> $ cat /sys/kernel/debug/swiotlb/io_tlb_used_extra
> 327552
>
> Would you mind helping if this is the right direction to go?
>
> Thank you very much!
>
> Cc: Konrad Wilk <konrad.wilk@xxxxxxxxxx>
> Cc: Joe Jin <joe.jin@xxxxxxxxxx>
> Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
> ---
> arch/arm/xen/mm.c | 2 +-
> arch/mips/pci/pci-octeon.c | 5 +-
> arch/x86/include/asm/xen/swiotlb-xen.h | 2 +-
> arch/x86/kernel/pci-dma.c | 6 +-
> drivers/xen/swiotlb-xen.c | 18 +-
> include/linux/swiotlb.h | 73 +++--

> diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
> index 3d826c0..4edfa42 100644
> --- a/arch/arm/xen/mm.c
> +++ b/arch/arm/xen/mm.c
> @@ -125,7 +125,7 @@ static int __init xen_mm_init(void)
> return 0;
>
> /* we can work with the default swiotlb */
> - if (!io_tlb_default_mem.nslabs) {
> + if (!io_tlb_default_mem.zone[SWIOTLB_DF].nslabs) {
> rc = swiotlb_init_late(swiotlb_size_or_default(),
> xen_swiotlb_gfp(), NULL);
> if (rc < 0)

First thing we need before doing anything about multiple default
pools is to get all the knowledge of details hidden inside swiotlb.c.

For swiotlb_init_late that seems easy as we can just move the check
into it.

> diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c
> index e457a18..0bf0859 100644
> --- a/arch/mips/pci/pci-octeon.c
> +++ b/arch/mips/pci/pci-octeon.c
> @@ -654,6 +654,9 @@ static int __init octeon_pci_setup(void)
> octeon_pci_mem_resource.end =
> octeon_pci_mem_resource.start + (1ul << 30);
> } else {
> + struct io_tlb_mem *mem = &io_tlb_default_mem;
> + struct io_tlb_zone *zone = &mem->zone[SWIOTLB_DF];
> +
> /* Remap the Octeon BAR 0 to map 128MB-(128MB+4KB) */
> octeon_npi_write32(CVMX_NPI_PCI_CFG04, 128ul << 20);
> octeon_npi_write32(CVMX_NPI_PCI_CFG05, 0);
> @@ -664,7 +667,7 @@ static int __init octeon_pci_setup(void)
>
> /* BAR1 movable regions contiguous to cover the swiotlb */
> octeon_bar1_pci_phys =
> - io_tlb_default_mem.start & ~((1ull << 22) - 1);
> + zone->start & ~((1ull << 22) - 1);

But we'll need to do something about this mess. I'll need help from
the octeon maintainer on it.

> - x86_swiotlb_flags |= SWIOTLB_ANY;
> + x86_swiotlb_flags |= SWIOTLB_ANY | SWIOTLB_EXTRA;

I don't think this is how it is suppoed to be. SWIOTLB_ANY already
says give me a pool with no addressing constrains. We don't need
two pools without that. EXTRA is also not exactly a very helpful
name here.

> #ifdef CONFIG_X86
> -int xen_swiotlb_fixup(void *buf, unsigned long nslabs)
> +int xen_swiotlb_fixup(void *buf, unsigned long nslabs, unsigned int flags)
> {
> int rc;
> unsigned int order = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT);
> unsigned int i, dma_bits = order + PAGE_SHIFT;
> dma_addr_t dma_handle;
> phys_addr_t p = virt_to_phys(buf);
> + unsigned int max_dma_bits = 32;

I think live will be a lot simple if the addressing bits are passed to
this function, and not some kind of flags.

> +#define SWIOTLB_DF 0
> +#define SWIOTLB_EX 1
> +#define SWIOTLB_NR 2

These names are not very descriptive.