RE: [PATCH] swiotlb: eliminate per-map atomic contention on used/hiwater tracking

From: Du, Fan

Date: Tue Jun 30 2026 - 22:54:45 EST

> -----Original Message-----
> From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> Sent: Sunday, June 28, 2026 9:30 AM
> To: Du, Fan <fan.du@xxxxxxxxx>; Michael Kelley <mhklinux@xxxxxxxxxxx>;
> Miao, Jun <jun.miao@xxxxxxxxx>; m.szyprowski@xxxxxxxxxxx;
> robin.murphy@xxxxxxx
> Cc: iommu@xxxxxxxxxxxxxxx; chenhgs@xxxxxxxxxxxxxxx; LKML <linux-
> kernel@xxxxxxxxxxxxxxx>
> Subject: RE: [PATCH] swiotlb: eliminate per-map atomic contention on
> used/hiwater tracking
>
> From: Du, Fan <fan.du@xxxxxxxxx> Sent: Saturday, June 27, 2026 4:21 PM
> >
> > > -----Original Message-----
> > > From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> > > Sent: Saturday, June 27, 2026 12:00 AM
> > >
> > > From: Du, Fan <fan.du@xxxxxxxxx> Sent: Thursday, June 25, 2026 8:12 PM
> > > >
> > > > > -----Original Message-----
> > > > > From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> > > > > Sent: Thursday, June 25, 2026 11:54 PM
> > > > >
> > > > > From: Du, Fan <fan.du@xxxxxxxxx> Sent: Thursday, June 25, 2026 12:30
> AM
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> > > > > > > Sent: Tuesday, June 23, 2026 10:35 AM
> > > > > > >
> > > > >
> > > > > [snip]
> > > > >
> > > > > > > > + pool = &mem->defpool;
> > > > > > > > + for (i = 0; i < pool->nareas; i++)
> > > > > > > > + hiwater += READ_ONCE(pool->areas[i].used_hiwater);
> > > > > > >
> > > > > > > Let's ignore the SWIOTLB_DYNAMIC case for simplicity. The
> approach
> > > > > > > of calculating a separate hiwater mark for each area, and then
> summing
> > > > > > > those per-area hiwater marks, can produce very wrong results.
> > > > > > >
> > > > > > > Consider a 64MiB swiotlb in a system with 8 CPUs. There will be 8
> areas,
> > > > > > > each with 8 MiB of space. Suppose the workload putters along
> with mostly
> > > > > > > smallish I/Os, say between 4 KiB and 32 KiB. If each area has 16
> I/Os in
> > > > > > > progress, the area hiwater mark might be 256 KiB (16 I/Os
> averaging 16 KiB
> > > > > > > each). Summing across areas produces a hiwater mark of 8 * 256
> KiB = 2 MiB.
> > > > > > > But then suppose a 2 MiB I/O comes in. The hiwater mark for the
> area that
> > > > > > > handles that I/O will grow to 2+ MiB. After the first big I/O finishes,
> > > > > > > another 2 MiB I/O comes in that is handled by a different area,
> whose
> > > > > > > hiwater mark also goes to 2+ MiB. Pretty soon all 8 areas have a
> hiwater
> > > > > > > mark of 2+ MiB, and the total hiwater mark is reported as 16+ MiB.
> The
> > > > > > > old algorithm would have reported 4+ MiB, which is accurate. With
> > > > > > > higher CPU counts and more areas, the discrepancy can get much
> worse.
> > > > > > > This is a somewhat contrived example, but the problem is real
> enough
> > > > > > > to make the reported hiwater mark be unreliable.
> > > > > >
> > > > > > You are correct here, I like your way of thinking.
> > > > > >
> > > > > > > I'm sure the contention for the total hiwater mark in the current
> code is
> > > > > > > real, but it's in the context of a lot of other CPU work that is being
> because
> > > > > > > of the bounce buffering, including copying lots of data to/from the
> bounce
> > > > > > > buffers. Is that atomic increment operation a bottleneck even in the
> > > > > > > end-to-end context of doing DMA through swiotlb bounce buffers?
> > > > > >
> > > > > > Practical benchmark show case the highest IO performance for each
> TVM spec.
> > > > > > even if a few iperf (4)workers would cause the contention here.
> > > > > > My guts feelings, yes, the real workload probably hit the bottleneck
> here.
> > > > >
> > > > > Just curious -- what is the NIC in the TDX VM? I'm most familiar with
> the
> > > > > Hyper-V case, where the NIC is the Hyper-V synthetic NIC. That driver
> > > > > uses dedicated send and receive buffers that are allocated and
> decrypted
> > > > > when the NIC is configured. Most NIC traffic goes through those buffers
> > > > > instead of the swiotlb, so I probably haven't seen cases where the
> swiotlb
> > > > > is the bottleneck for NIC traffic. I do see the swiotlb as the bottleneck
> for
> > > > > disk I/O traffic, but the data copying tends to be the gate rather than
> the
> > > > > allocation and freeing of swiotlb buffers.
> > > > >
> > > > > >
> > > > > > > Another approach to the contention problem would be to have a
> separate
> > > > > > > CONFIG option that is narrower than CONFIG_DEBUG_FS, so that
> the
> > > > > > > computation of the hiwater mark can be dropped entirely in
> production
> > > > > > > environments. Or the setting could be dynamic at runtime via a
> > > > > > > static_call, defaulting to not computing the hiwater mark while still
> > > > > > > allowing a sysadmin to turn it on to see workload usage of the
> swiotlb.
> > > > > >
> > > > > > That's counter-intuitive from my perspective.
> > > > > > With global counters, the observation, which itself impacts the
> performance,
> > > > > > wouldn't be able to tell the practical characterization, that's
> commonly lower than
> > > > > > max performance, in turn breaks the semantics of what's it for.
> > > > >
> > > > > Agreed. If the global counters affect the performance and throughput
> > > > > significantly, having an accurate hiwater mark loses some of its value.
> > > > >
> > > > > >
> > > > > > Even without those global counters, if user wants to know the
> hiwater value,
> > > > > > snapshotting used value(sum of each area as current behavior)
> periodically would
> > > > > > produce meaningful value for workload evaluation.
> > > > >
> > > > > I'm a little skeptical of the value of just summing current usage. Doing
> so
> > > > > tends to miss any spikes, and the spikes are the problem. If swiotlb
> capacity
> > > >
> > > > That's current design when CONFIG_DEBUG_FS is off, and used as
> swiotlb
> > > > shortage indicator for user.
> > > >
> > > > Statistically that sampled value is approximate to the true value as
> always.
> > >
> > > OK, yes, statistical sampling could work. The kernel queues a work item
> > > that runs periodically to sum current usage across all areas. That sum is
> > > compared against the previous sum to calculate a hiwater mark. An
> > > experiment to compare the statistical hiwater mark against the current
> > > exact calculation would be interesting. I wonder how many samples would
> > > be needed, and hence how frequently it would need to run, to get a good
> > > result.
> >
> > Kernel doesn't do that sampling, kernel only report the sum as user space
> requested.
> > Take a look at the flow when DEBUG_FS is off, and essentially it works as
> vmstat,
> > user set query interval.
>
> Agree -- the kernel doesn't do that now. I wasn't clear in my comment that
> I was envisioning that the kernel *could* be enhanced to do the statistical
> sampling and report the result via sysfs.

OK, thanks for the inputs!

Based on previous discussions, to keep backward compatibility, how about keep
hiwater as it is, while guarded by dynamic knob(default off), and export used
memory from sum of per area as current design does, then user can still have a
chance to track the overall usage w/o visible overhead introduced by global atomic
counter?

> >
> > > >
> > > > > is exceeded even for a short spike, you don't just get a performance
> blip.
> > > > > You get I/O failures, which at least on the disk side tends to be fatal to
> the
> > > > > application doing the I/O. Maybe the networking stack recovers well
> enough
> > > > > and retries, resulting in just a performance reduction. But I've always
> thought
> > > > > of swiotlb exhaustion as a fairly serious problem to be avoided at all
> costs.
> > > > > That's why CoCo VMs allocate so much swiotlb space, even though
> most of
> > > > > it is never used for typical workloads (at least in my experience).
> > > >
> > > > That's dynamic SWIOTLB is designed for.
> > > > Only when the IO is so intensive, transient buffer/DMA pool is
> exhausted quickly
> > > > before new shared memory pool is created.
> > >
> > > I don't think dynamic SWIOTLB in its current implementation is very useful
> > > for large CoCo VMs. Exhausting the atomic DMA pool is one problem. The
> > > dynamic swiotlb also can grow in a max of 4 MiB increments, so if 400 MiB
> is
> > > added dynamically, there's a list of 100 entries that must be searched
> > > to find space (and that list is currently searched linearly). Furthermore,
> > > the pre-allocated swiotlb gets 1 area per CPU, which mostly avoids
> contention
> > > on the area spin locks. But a dynamically added 4 MiB pool can have a
> max of
> > > 16 areas because each area must be at least 256 KiB. So there's more area
> > > spin lock contention with higher CPUs counts. And that all assumes that
> the
> > > memory allocator can provide a 4 MiB contiguous area for the new pool.
> If
> > > the swiotlb needs to grow after there's been memory fragmentation, an
> > > added pool might be limited to 2 MiB or 1 MiB or smaller, with a
> > > corresponding reduction in the area count and increase in area spin lock
> > > contention. Overall, dynamic swiotlb in a big CoCo VM results in complex
> > > and messy behavior with new failure modes and bottlenecks.
> >
> > Fragmentation indeed undermine the requested allocation size as dynamic
> > swiotlb grows by default 64M .
>
> Actually, I don't think the dynamic swiotlb ever grows by 64 MiB, at least
> not for systems with a 4 KiB page size. Yes, swiotlb_dyn_alloc() requests
> a new pool with size "default_nslabs", which might be 64 MiB. But the
> memory ultimately comes from the buddy allocator, which can provide
> a maximum of 4 MiB (unless MAX_ORDER has been overridden).
> swiotlb_alloc_pool() tries 64 MiB, which fails, so it then tries 32 MiB,
> etc., until it gets down to a size that the buddy allocator can provide.
> The most it will get is 4 MiB, and perhaps less if memory is fragmented.
>
> Michael
>
> > In production practice, for large TVM, we see admin will reserve
> > majority of needed size, and let dynamic swiotlb works to add additional
> buffer.
> > Current dynamic switlb doesn't support shrinker to release unused memory
> back
> > to buddy yet.