RE: [PATCH] swiotlb: eliminate per-map atomic contention on used/hiwater tracking

From: Du, Fan

Date: Thu Jun 25 2026 - 03:30:31 EST

> -----Original Message-----
> From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> Sent: Tuesday, June 23, 2026 10:35 AM
> To: Miao, Jun <jun.miao@xxxxxxxxx>; m.szyprowski@xxxxxxxxxxx;
> robin.murphy@xxxxxxx
> Cc: iommu@xxxxxxxxxxxxxxx; chenhgs@xxxxxxxxxxxxxxx; Du, Fan
> <fan.du@xxxxxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>
> Subject: RE: [PATCH] swiotlb: eliminate per-map atomic contention on
> used/hiwater tracking
>
>
> From: Jun Miao <jun.miao@xxxxxxxxx> Sent: Monday, June 22, 2026 5:21 AM
> >
>
> [Adding LKML to the Cc: list]
>
> > Under heavy concurrent DMA traffic, inc_used_and_hiwater() performs an
> > atomic_long_add_return() plus a CAS loop on the global used_hiwater, and
> > dec_used() performs an atomic_long_sub() on total_used.
>
> This statement seems overly broad. Normally the swiotlb is rarely used, so
> heavy concurrent DMA traffic doesn't generate this problem. The bad case
> would be in a CoCo VM where all DMA traffic goes through the swiotlb. If
> your statement is based on the CoCo VM case, please call that out.
>
> > All CPUs contend
> > on the same cacheline, causing measurable throughput degradation at
> scale.
> > Both counters are debug-only (CONFIG_DEBUG_FS).
> >
> > Move hiwater tracking into struct io_tlb_area, where area->used is
> > already maintained under area->lock:
> >
> > - Add unsigned long used_hiwater to struct io_tlb_area. Update it
> > inside the existing area->lock critical section in
> > swiotlb_search_pool_area() immediately after area->used is
> > incremented, at zero additional synchronisation cost.
> >
> > - Remove atomic_long_t total_used and used_hiwater from struct
> > io_tlb_mem; make inc_used_and_hiwater()/dec_used() empty stubs.
> >
> > - Unify the two CONFIG_DEBUG_FS / !CONFIG_DEBUG_FS versions of
> > mem_used() into a single implementation that sums per-area used
> > counts with READ_ONCE().
> >
> > - Rewrite io_tlb_hiwater_get() to sum area->used_hiwater across all
> > areas (locklessly with READ_ONCE()); rewrite io_tlb_hiwater_set() to
> > reset each area under its own spinlock.
> >
> > Result: zero global atomic operations on the DMA map/unmap hot path.
>
> Another result (as I describe below in the code) is that the reported
> hiwater value can now be egregiously wrong. Back when I added
> reporting the hiwater value in commit 8b0977ecc8b3, there was
> considerable discussion of the potential impact of the atomic operations,
> and the decision was to place reporting under CONFIG_DEBUG_FS as a
> mitigation option if needed.
>
> At the time, the use case for the hiwater mark was to enable a sysadmin
> to see how much peak swiotlb space is needed for a particular workload.
> This info would allow the swiotlb size to be set smaller than the rather large
> default that is used in CoCo VMs, and free up memory for real workload

> usage. I thought about the approach you've taken here, but my view at
> the time was that approach is inaccurate enough so as to be fairly
> worthless for the intended use case.
>
> >
> > Without this patch, the inc_used_and_hiwater() accounts for the most time
> > from perf assembly record as following log:
> >
> > | spin_unlock_irqrestore():
> > | - call _raw_spin_unlock_irqrestore
> > | swiotlb_area_find_slots():
> > 10.38 | mov $0x60(%rsp),%rax
> > 0.00 | mov $0x2b8(%rax),%rcx
> > | arch_atomic64_add_return():
> > 0.00 | mov %r15,%rax
> > 0.00 | lock xadd %rax,0x58(%rcx)
> > 66.74 | lea (%r15,%rax,1),%rdx
> > | arch_atomic64_read():
> > 0.14 | mov $0x60(%rcx),%rax
> > | inc_used_and_hiwater():
> > 19.74 | cmp %rdx,%rax
> > | jae 43f
> > | arch_atomic64_try_cmpxchg():
> > | lock cmpxchg %rdx,0x60(%rcx)
> > | jne 432
> > | swiotlb_area_find_slots():
> > 0.07 | mov %ebx,%eax
> > 0.04 | jmp 289
> > | wrap_area_index():
> > | xor %edx,%edx
> > | jmp 3fb
> >
> > Suggested-by: Fan Du <fan.du@xxxxxxxxx>
> > Co-developed-by: Chenhgs <chenhgs@xxxxxxxxxxxxxxx>
> > Tested-by: Chenhgs <chenhgs@xxxxxxxxxxxxxxx>
> > Signed-off-by: Jun Miao <jun.miao@xxxxxxxxx>
> > ---
> > include/linux/swiotlb.h | 7 ---
> > kernel/dma/swiotlb.c | 109 +++++++++++++++++-----------------------
> > 2 files changed, 45 insertions(+), 71 deletions(-)
> >
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index 3dae0f592063..a92496786cb5 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -97,11 +97,6 @@ struct io_tlb_pool {
> > * @lock: Lock to synchronize changes to the list.
> > * @pools: List of IO TLB memory pool descriptors (if dynamic).
> > * @dyn_alloc: Dynamic IO TLB pool allocation work.
> > - * @total_used: The total number of slots in the pool that are
> currently used
> > - * across all areas. Used only for calculating used_hiwater in
> > - * debugfs.
> > - * @used_hiwater: The high water mark for total_used. Used only for
> reporting
> > - * in debugfs.
> > * @transient_nslabs: The total number of slots in all transient pools that
> > * are currently used across all areas.
> > */
> > @@ -119,8 +114,6 @@ struct io_tlb_mem {
> > struct work_struct dyn_alloc;
> > #endif
> > #ifdef CONFIG_DEBUG_FS
> > - atomic_long_t total_used;
> > - atomic_long_t used_hiwater;
> > atomic_long_t transient_nslabs;
> > #endif
> > };
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index 1abd3e6146f4..9b4130e3f8e8 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -107,12 +107,15 @@ static unsigned long default_nareas;
> > * This is a single area with a single lock.
> > *
> > * @used: The number of used IO TLB block.
> > + * @used_hiwater: Per-area high-water mark of used slots. Updated under
> > + * @lock; read locklessly for approximate debugfs reporting.
> > * @index: The slot index to start searching in this area for next round.
> > * @lock: The lock to protect the above data structures in the map and
> > * unmap calls.
> > */
> > struct io_tlb_area {
> > unsigned long used;
> > + unsigned long used_hiwater;
> > unsigned int index;
> > spinlock_t lock;
> > };
> > @@ -959,40 +962,6 @@ static unsigned int wrap_area_index(struct
> io_tlb_pool *mem, unsigned int index)
> > return index;
> > }
> >
> > -/*
> > - * Track the total used slots with a global atomic value in order to have
> > - * correct information to determine the high water mark. The mem_used()
> > - * function gives imprecise results because there's no locking across
> > - * multiple areas.
> > - */
> > -#ifdef CONFIG_DEBUG_FS
> > -static void inc_used_and_hiwater(struct io_tlb_mem *mem, unsigned int
> nslots)
> > -{
> > - unsigned long old_hiwater, new_used;
> > -
> > - new_used = atomic_long_add_return(nslots, &mem->total_used);
> > - old_hiwater = atomic_long_read(&mem->used_hiwater);
> > - do {
> > - if (new_used <= old_hiwater)
> > - break;
> > - } while (!atomic_long_try_cmpxchg(&mem->used_hiwater,
> > - &old_hiwater, new_used));
> > -}
> > -
> > -static void dec_used(struct io_tlb_mem *mem, unsigned int nslots)
> > -{
> > - atomic_long_sub(nslots, &mem->total_used);
> > -}
> > -
> > -#else /* !CONFIG_DEBUG_FS */
> > -static void inc_used_and_hiwater(struct io_tlb_mem *mem, unsigned int
> nslots)
> > -{
> > -}
> > -static void dec_used(struct io_tlb_mem *mem, unsigned int nslots)
> > -{
> > -}
> > -#endif /* CONFIG_DEBUG_FS */
> > -
> > #ifdef CONFIG_SWIOTLB_DYNAMIC
> > #ifdef CONFIG_DEBUG_FS
> > static void inc_transient_used(struct io_tlb_mem *mem, unsigned int
> nslots)
> > @@ -1132,9 +1101,10 @@ static int swiotlb_search_pool_area(struct device
> *dev, struct io_tlb_pool *pool
> > */
> > area->index = wrap_area_index(pool, index + nslots);
> > area->used += nslots;
> > + if (area->used > area->used_hiwater)
> > + area->used_hiwater = area->used;
> > spin_unlock_irqrestore(&area->lock, flags);
> >
> > - inc_used_and_hiwater(dev->dma_io_tlb_mem, nslots);
> > return slot_index;
> > }
> >
> > @@ -1295,29 +1265,12 @@ static int swiotlb_find_slots(struct device *dev,
> phys_addr_t orig_addr,
> >
> > #endif /* CONFIG_SWIOTLB_DYNAMIC */
> >
> > -#ifdef CONFIG_DEBUG_FS
> > -
> > -/**
> > - * mem_used() - get number of used slots in an allocator
> > - * @mem: Software IO TLB allocator.
> > - *
> > - * The result is accurate in this version of the function, because an atomic
> > - * counter is available if CONFIG_DEBUG_FS is set.
> > - *
> > - * Return: Number of used slots.
> > - */
> > -static unsigned long mem_used(struct io_tlb_mem *mem)
> > -{
> > - return atomic_long_read(&mem->total_used);
> > -}
> > -
> > -#else /* !CONFIG_DEBUG_FS */
> > -
> > /**
> > * mem_pool_used() - get number of used slots in a memory pool
> > * @pool: Software IO TLB memory pool.
> > *
> > - * The result is not accurate, see mem_used().
> > + * Returns the sum of per-area used slot counts. The result is approximate
> > + * since there is no locking across areas.
> > *
> > * Return: Approximate number of used slots.
> > */
> > @@ -1327,7 +1280,7 @@ static unsigned long mem_pool_used(struct
> io_tlb_pool *pool)
> > unsigned long used = 0;
> >
> > for (i = 0; i < pool->nareas; i++)
> > - used += pool->areas[i].used;
> > + used += READ_ONCE(pool->areas[i].used);
> > return used;
> > }
> >
> > @@ -1335,8 +1288,8 @@ static unsigned long mem_pool_used(struct
> io_tlb_pool *pool)
> > * mem_used() - get number of used slots in an allocator
> > * @mem: Software IO TLB allocator.
> > *
> > - * The result is not accurate, because there is no locking of individual
> > - * areas.
> > + * Returns the sum of per-area used slot counts across all pools. The result
> > + * is approximate since there is no locking across areas.
> > *
> > * Return: Approximate number of used slots.
> > */
> > @@ -1357,8 +1310,6 @@ static unsigned long mem_used(struct
> io_tlb_mem *mem)
> > #endif
> > }
> >
> > -#endif /* CONFIG_DEBUG_FS */
> > -
> > /**
> > * swiotlb_tbl_map_single() - bounce buffer map a single contiguous
> physical area
> > * @dev: Device which maps the buffer.
> > @@ -1508,8 +1459,6 @@ static void swiotlb_release_slots(struct device
> *dev, phys_addr_t tlb_addr,
> > mem->slots[i].list = ++count;
> > area->used -= nslots;
> > spin_unlock_irqrestore(&area->lock, flags);
> > -
> > - dec_used(dev->dma_io_tlb_mem, nslots);
> > }
> >
> > #ifdef CONFIG_SWIOTLB_DYNAMIC
> > @@ -1531,7 +1480,6 @@ static bool swiotlb_del_transient(struct device
> *dev, phys_addr_t tlb_addr,
> > if (!pool->transient)
> > return false;
> >
> > - dec_used(dev->dma_io_tlb_mem, pool->nslabs);
> > swiotlb_del_pool(dev, pool);
> > dec_transient_used(dev->dma_io_tlb_mem, pool->nslabs);
> > return true;
> > @@ -1710,20 +1658,53 @@ static int io_tlb_used_get(void *data, u64 *val)
> > static int io_tlb_hiwater_get(void *data, u64 *val)
> > {
> > struct io_tlb_mem *mem = data;
> > + unsigned long hiwater = 0;
> > + struct io_tlb_pool *pool;
> > + int i;
> >
> > - *val = atomic_long_read(&mem->used_hiwater);
> > +#ifdef CONFIG_SWIOTLB_DYNAMIC
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(pool, &mem->pools, node)
> > + for (i = 0; i < pool->nareas; i++)
> > + hiwater += READ_ONCE(pool->areas[i].used_hiwater);
> > + rcu_read_unlock();
> > +#else
> > + pool = &mem->defpool;
> > + for (i = 0; i < pool->nareas; i++)
> > + hiwater += READ_ONCE(pool->areas[i].used_hiwater);
>
> Let's ignore the SWIOTLB_DYNAMIC case for simplicity. The approach
> of calculating a separate hiwater mark for each area, and then summing
> those per-area hiwater marks, can produce very wrong results.
>
> Consider a 64MiB swiotlb in a system with 8 CPUs. There will be 8 areas,
> each with 8 MiB of space. Suppose the workload putters along with mostly
> smallish I/Os, say between 4 KiB and 32 KiB. If each area has 16 I/Os in
> progress, the area hiwater mark might be 256 KiB (16 I/Os averaging 16 KiB
> each). Summing across areas produces a hiwater mark of 8 * 256 KiB = 2 MiB.
> But then suppose a 2 MiB I/O comes in. The hiwater mark for the area that
> handles that I/O will grow to 2+ MiB. After the first big I/O finishes,
> another 2 MiB I/O comes in that is handled by a different area, whose
> hiwater mark also goes to 2+ MiB. Pretty soon all 8 areas have a hiwater
> mark of 2+ MiB, and the total hiwater mark is reported as 16+ MiB. The
> old algorithm would have reported 4+ MiB, which is accurate. With
> higher CPU counts and more areas, the discrepancy can get much worse.
> This is a somewhat contrived example, but the problem is real enough
> to make the reported hiwater mark be unreliable.

You are correct here, I like your way of thinking.

> I'm sure the contention for the total hiwater mark in the current code is
> real, but it's in the context of a lot of other CPU work that is being because
> of the bounce buffering, including copying lots of data to/from the bounce
> buffers. Is that atomic increment operation a bottleneck even in the
> end-to-end context of doing DMA through swiotlb bounce buffers?

Practical benchmark show case the highest IO performance for each TVM spec.
even if a few iperf (4)workers would cause the contention here.
My guts feelings, yes, the real workload probably hit the bottleneck here.

> Another approach to the contention problem would be to have a separate
> CONFIG option that is narrower than CONFIG_DEBUG_FS, so that the
> computation of the hiwater mark can be dropped entirely in production
> environments. Or the setting could be dynamic at runtime via a
> static_call, defaulting to not computing the hiwater mark while still
> allowing a sysadmin to turn it on to see workload usage of the swiotlb.

That's counter-intuitive from my perspective.
With global counters, the observation, which itself impacts the performance,
wouldn't be able to tell the practical characterization, that's commonly lower than
max performance, in turn breaks the semantics of what's it for.

Even without those global counters, if user wants to know the hiwater value,
snapshotting used value(sum of each area as current behavior) periodically would
produce meaningful value for workload evaluation.

> Michael
>
> > +#endif
> > + *val = hiwater;
> > return 0;
> > }
> >
> > static int io_tlb_hiwater_set(void *data, u64 val)
> > {
> > struct io_tlb_mem *mem = data;
> > + struct io_tlb_pool *pool;
> > + unsigned long flags;
> > + int i;
> >
> > /* Only allow setting to zero */
> > if (val != 0)
> > return -EINVAL;
> >
> > - atomic_long_set(&mem->used_hiwater, val);
> > +#ifdef CONFIG_SWIOTLB_DYNAMIC
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(pool, &mem->pools, node)
> > + for (i = 0; i < pool->nareas; i++) {
> > + spin_lock_irqsave(&pool->areas[i].lock, flags);
> > + pool->areas[i].used_hiwater = 0;
> > + spin_unlock_irqrestore(&pool->areas[i].lock, flags);
> > + }
> > + rcu_read_unlock();
> > +#else
> > + pool = &mem->defpool;
> > + for (i = 0; i < pool->nareas; i++) {
> > + spin_lock_irqsave(&pool->areas[i].lock, flags);
> > + pool->areas[i].used_hiwater = 0;
> > + spin_unlock_irqrestore(&pool->areas[i].lock, flags);
> > + }
> > +#endif
> > return 0;
> > }
> >
> > --
> > 2.32.0
> >