RE: [RFC V1 1/5] swiotlb: Support allocating DMA memory from SWIOTLB

From: Michael Kelley
Date: Sat Feb 24 2024 - 17:03:06 EST

Next message: Joerg Roedel: "[git pull] IOMMU Fixes for Linux v6.8-rc5"
Previous message: Daniel Lezcano: "Re: [PATCH v2] clocksource/drivers/arm_global_timer: Simplify prescaler register access"
In reply to: Vishal Annapurve: "Re: [RFC V1 1/5] swiotlb: Support allocating DMA memory from SWIOTLB"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Vishal Annapurve <vannapurve@xxxxxxxxxx> Sent: Saturday, February 24, 2024 9:07 AM
>
> On Fri, Feb 16, 2024 at 1:56 AM Michael Kelley <mhklinux@xxxxxxxxxxx> wrote:
> >
> > From: Alexander Graf <graf@xxxxxxxxxx> Sent: Thursday, February 15, 2024 1:44 AM
> > >
> > > On 15.02.24 04:33, Vishal Annapurve wrote:
> > > > On Wed, Feb 14, 2024 at 8:20 PM Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:
> > > >> On Fri, Jan 12, 2024 at 05:52:47AM +0000, Vishal Annapurve wrote:
> > > >>> Modify SWIOTLB framework to allocate DMA memory always from SWIOTLB.
> > > >>>
> > > >>> CVMs use SWIOTLB buffers for bouncing memory when using dma_map_* APIs
> > > >>> to setup memory for IO operations. SWIOTLB buffers are marked as shared
> > > >>> once during early boot.
> > > >>>
> > > >>> Buffers allocated using dma_alloc_* APIs are allocated from kernel memory
> > > >>> and then converted to shared during each API invocation. This patch ensures
> > > >>> that such buffers are also allocated from already shared SWIOTLB
> > > >>> regions. This allows enforcing alignment requirements on regions marked
> > > >>> as shared.
> > > >> But does it work in practice?
> > > >>
> > > >> Some devices (like GPUs) require a lot of DMA memory. So with this approach
> > > >> we would need to have huge SWIOTLB buffer that is unused in most VMs.
> > > >>
> > > > Current implementation limits the size of statically allocated SWIOTLB
> > > > memory pool to 1G. I was proposing to enable dynamic SWIOTLB for CVMs
> > > > in addition to aligning the memory allocations to hugepage sizes, so
> > > > that the SWIOTLB pool can be scaled up on demand.
> > > >
> >
> > Vishal --
> >
> > When the dynamic swiotlb mechanism tries to grow swiotlb space
> > by adding another pool, it gets the additional memory as a single
> > physically contiguous chunk using alloc_pages(). It starts by trying
> > to allocate a chunk the size of the original swiotlb size, and if that
> > fails, halves the size until it gets a size where the allocation succeeds.
> > The minimum size is 1 Mbyte, and if that fails, the "grow" fails.
> >
>
> Thanks for pointing this out.
>
> > So it seems like dynamic swiotlb is subject to the almost the same
> > memory fragmentation limitations as trying to allocate huge pages.
> > swiotlb needs a minimum of 1 Mbyte contiguous in order to grow,
> > while huge pages need 2 Mbytes, but either is likely to be
> > problematic in a VM that has been running a while. With that
> > in mind, I'm not clear on the benefit of enabling dynamic swiotlb.
> > It seems like it just moves around the problem of needing high order
> > contiguous memory allocations. Or am I missing something?
> >
>
> Currently the SWIOTLB pool is limited to 1GB in size. Kirill has
> pointed out that devices like GPUs could need a significant amount of
> memory to be allocated from the SWIOTLB pool. Without dynamic SWIOTLB,
> such devices run the risk of memory exhaustion without any hope of
> recovery.

Actually, in a CoCo VM the swiotlb pool *defaults* to 6% of
memory, with a max of 1 Gbyte. See mem_encrypt_setup_arch().
So in a CoCo VM with 16 Gbytes of memory or more, you
typically see 1 Gbyte as the swiotlb size. But that's only the
default. Using the kernel boot line parameter "swiotlb=<nnn>"
you can set the initial swiotlb size to something larger, with the
max typically being in the 3 Gbyte range. The 3 Gbyte limit arises
because the swiotlb pool must reside below the 4 Gbyte line
in the CoCo VM's guest physical memory, and there are other
things like 32-bit MMIO space that also must fit below the
4 Gbyte line. Related, I've contemplated submitting a patch to
allow the swiotlb pool in a CoCo VM to be above the 4 Gbyte
line, since the original reasons for being below the 4 Gbyte line
probably don’t apply in a CoCo VM. With such a change, the
kernel boot line could specify an even larger initial swiotlb pool.

But as you point out, calculating a maximum ahead of time
is not really possible, and choosing a really large value to be
"safe" is likely to waste a lot of memory in most cases. So
using dynamic swiotlb makes sense, except that you can't
be sure that dynamic growth will really work because
fragmentation may prevent getting enough contiguous
memory.

One other consideration with dynamic swiotlb pools:
The initially allocated swiotlb pool is divided into "areas",
defaulting to one area for each vCPU in the VM. This allows
CPUs to do swiotlb allocations from their area without having
to contend for a shared spin lock. But if a CPU finds its own
area is full, then it will search the areas of other CPUs, which
can produce spin lock contention, though hopefully that's
rare. In the case of a dynamically allocated addition of 2
Mbytes (for example), the number of areas is limited to
2M/256K = 8. In a VM with more than 8 vCPUs, multiple
CPUs will immediately be contending for the same area
In the dynamically allocated addition, and we've seen
that swiotlb spin lock contention can be a perf issue in
CoCo VMs. Being able to allocate a memory chunk bigger
than 2 Mbytes would allow for more areas, but of course
success in allocating bigger chunks is less likely and the
alloc_pages() limit on x86/x64 is 4 Mbytes anyway.

Overall, my point is that there are tradeoffs. Dynamic
swiotlb may look like a good approach, but it has some
downsides that aren't immediately obvious. My
understanding is that the original motivation for dynamic
swiotlb was small systems with limited memory, where you
could start with a really small swiotlb, and then grow as
necessary. It's less of a good fit on large CoCo VMs with
dozens of vCPUs, for the reasons described above.

>
> In addition, I am proposing to have dma_alloc_* APIs to use the
> SWIOTLB area as well, adding to the memory pressure. If there was a
> way to calculate the maximum amount of memory needed for all dma
> allocations for all possible devices used by CoCo VMs then one can use
> that number to preallocate SWIOTLB pool. I am arguing that calculating
> the maximum bound would be difficult and instead of trying to
> calculate it, allowing SWIOTLB to scale dynamically would be better
> since it provides better .

Agreed, but see above. I'm not saying dynamic swiotlb
can't be used, but just to be aware of the tradeoffs.

Michael

>
> So if the above argument for enabling dynamic SWIOTLB makes sense then
> it should be relatively easy to append hugepage alignment restrictions
> for SWIOTLB pool increments (inline with the fact that 2MB vs 1MB size
> allocations are nearly equally prone to failure due to memory
> fragmentation).
>
> > Michael
> >
> > > > The issue with aligning the pool areas to hugepages is that hugepage
> > > > allocation at runtime is not guaranteed. Guaranteeing the hugepage
> > > > allocation might need calculating the upper bound in advance, which
> > > > defeats the purpose of enabling dynamic SWIOTLB. I am open to
> > > > suggestions here.
> > >
> > >
> > > You could allocate a max bound at boot using CMA and then only fill into
> > > the CMA area when SWIOTLB size requirements increase? The CMA region
> > > will allow movable allocations as long as you don't require the CMA space.
> > >
> > >
> > > Alex
> >

Next message: Joerg Roedel: "[git pull] IOMMU Fixes for Linux v6.8-rc5"
Previous message: Daniel Lezcano: "Re: [PATCH v2] clocksource/drivers/arm_global_timer: Simplify prescaler register access"
In reply to: Vishal Annapurve: "Re: [RFC V1 1/5] swiotlb: Support allocating DMA memory from SWIOTLB"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]