RE: [PATCH] swiotlb: eliminate per-map atomic contention on used/hiwater tracking

From: Michael Kelley

Date: Sat Jun 27 2026 - 21:30:28 EST

From: Du, Fan <fan.du@xxxxxxxxx> Sent: Saturday, June 27, 2026 4:21 PM
>
> > -----Original Message-----
> > From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> > Sent: Saturday, June 27, 2026 12:00 AM
> >
> > From: Du, Fan <fan.du@xxxxxxxxx> Sent: Thursday, June 25, 2026 8:12 PM
> > >
> > > > -----Original Message-----
> > > > From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> > > > Sent: Thursday, June 25, 2026 11:54 PM
> > > >
> > > > From: Du, Fan <fan.du@xxxxxxxxx> Sent: Thursday, June 25, 2026 12:30 AM
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> > > > > > Sent: Tuesday, June 23, 2026 10:35 AM
> > > > > >
> > > >
> > > > [snip]
> > > >
> > > > > > > + pool = &mem->defpool;
> > > > > > > + for (i = 0; i < pool->nareas; i++)
> > > > > > > + hiwater += READ_ONCE(pool->areas[i].used_hiwater);
> > > > > >
> > > > > > Let's ignore the SWIOTLB_DYNAMIC case for simplicity. The approach
> > > > > > of calculating a separate hiwater mark for each area, and then summing
> > > > > > those per-area hiwater marks, can produce very wrong results.
> > > > > >
> > > > > > Consider a 64MiB swiotlb in a system with 8 CPUs. There will be 8 areas,
> > > > > > each with 8 MiB of space. Suppose the workload putters along with mostly
> > > > > > smallish I/Os, say between 4 KiB and 32 KiB. If each area has 16 I/Os in
> > > > > > progress, the area hiwater mark might be 256 KiB (16 I/Os averaging 16 KiB
> > > > > > each). Summing across areas produces a hiwater mark of 8 * 256 KiB = 2 MiB.
> > > > > > But then suppose a 2 MiB I/O comes in. The hiwater mark for the area that
> > > > > > handles that I/O will grow to 2+ MiB. After the first big I/O finishes,
> > > > > > another 2 MiB I/O comes in that is handled by a different area, whose
> > > > > > hiwater mark also goes to 2+ MiB. Pretty soon all 8 areas have a hiwater
> > > > > > mark of 2+ MiB, and the total hiwater mark is reported as 16+ MiB. The
> > > > > > old algorithm would have reported 4+ MiB, which is accurate. With
> > > > > > higher CPU counts and more areas, the discrepancy can get much worse.
> > > > > > This is a somewhat contrived example, but the problem is real enough
> > > > > > to make the reported hiwater mark be unreliable.
> > > > >
> > > > > You are correct here, I like your way of thinking.
> > > > >
> > > > > > I'm sure the contention for the total hiwater mark in the current code is
> > > > > > real, but it's in the context of a lot of other CPU work that is being because
> > > > > > of the bounce buffering, including copying lots of data to/from the bounce
> > > > > > buffers. Is that atomic increment operation a bottleneck even in the
> > > > > > end-to-end context of doing DMA through swiotlb bounce buffers?
> > > > >
> > > > > Practical benchmark show case the highest IO performance for each TVM spec.
> > > > > even if a few iperf (4)workers would cause the contention here.
> > > > > My guts feelings, yes, the real workload probably hit the bottleneck here.
> > > >
> > > > Just curious -- what is the NIC in the TDX VM? I'm most familiar with the
> > > > Hyper-V case, where the NIC is the Hyper-V synthetic NIC. That driver
> > > > uses dedicated send and receive buffers that are allocated and decrypted
> > > > when the NIC is configured. Most NIC traffic goes through those buffers
> > > > instead of the swiotlb, so I probably haven't seen cases where the swiotlb
> > > > is the bottleneck for NIC traffic. I do see the swiotlb as the bottleneck for
> > > > disk I/O traffic, but the data copying tends to be the gate rather than the
> > > > allocation and freeing of swiotlb buffers.
> > > >
> > > > >
> > > > > > Another approach to the contention problem would be to have a separate
> > > > > > CONFIG option that is narrower than CONFIG_DEBUG_FS, so that the
> > > > > > computation of the hiwater mark can be dropped entirely in production
> > > > > > environments. Or the setting could be dynamic at runtime via a
> > > > > > static_call, defaulting to not computing the hiwater mark while still
> > > > > > allowing a sysadmin to turn it on to see workload usage of the swiotlb.
> > > > >
> > > > > That's counter-intuitive from my perspective.
> > > > > With global counters, the observation, which itself impacts the performance,
> > > > > wouldn't be able to tell the practical characterization, that's commonly lower than
> > > > > max performance, in turn breaks the semantics of what's it for.
> > > >
> > > > Agreed. If the global counters affect the performance and throughput
> > > > significantly, having an accurate hiwater mark loses some of its value.
> > > >
> > > > >
> > > > > Even without those global counters, if user wants to know the hiwater value,
> > > > > snapshotting used value(sum of each area as current behavior) periodically would
> > > > > produce meaningful value for workload evaluation.
> > > >
> > > > I'm a little skeptical of the value of just summing current usage. Doing so
> > > > tends to miss any spikes, and the spikes are the problem. If swiotlb capacity
> > >
> > > That's current design when CONFIG_DEBUG_FS is off, and used as swiotlb
> > > shortage indicator for user.
> > >
> > > Statistically that sampled value is approximate to the true value as always.
> >
> > OK, yes, statistical sampling could work. The kernel queues a work item
> > that runs periodically to sum current usage across all areas. That sum is
> > compared against the previous sum to calculate a hiwater mark. An
> > experiment to compare the statistical hiwater mark against the current
> > exact calculation would be interesting. I wonder how many samples would
> > be needed, and hence how frequently it would need to run, to get a good
> > result.
>
> Kernel doesn't do that sampling, kernel only report the sum as user space requested.
> Take a look at the flow when DEBUG_FS is off, and essentially it works as vmstat,
> user set query interval.

Agree -- the kernel doesn't do that now. I wasn't clear in my comment that
I was envisioning that the kernel *could* be enhanced to do the statistical
sampling and report the result via sysfs.

>
> > >
> > > > is exceeded even for a short spike, you don't just get a performance blip.
> > > > You get I/O failures, which at least on the disk side tends to be fatal to the
> > > > application doing the I/O. Maybe the networking stack recovers well enough
> > > > and retries, resulting in just a performance reduction. But I've always thought
> > > > of swiotlb exhaustion as a fairly serious problem to be avoided at all costs.
> > > > That's why CoCo VMs allocate so much swiotlb space, even though most of
> > > > it is never used for typical workloads (at least in my experience).
> > >
> > > That's dynamic SWIOTLB is designed for.
> > > Only when the IO is so intensive, transient buffer/DMA pool is exhausted quickly
> > > before new shared memory pool is created.
> >
> > I don't think dynamic SWIOTLB in its current implementation is very useful
> > for large CoCo VMs. Exhausting the atomic DMA pool is one problem. The
> > dynamic swiotlb also can grow in a max of 4 MiB increments, so if 400 MiB is
> > added dynamically, there's a list of 100 entries that must be searched
> > to find space (and that list is currently searched linearly). Furthermore,
> > the pre-allocated swiotlb gets 1 area per CPU, which mostly avoids contention
> > on the area spin locks. But a dynamically added 4 MiB pool can have a max of
> > 16 areas because each area must be at least 256 KiB. So there's more area
> > spin lock contention with higher CPUs counts. And that all assumes that the
> > memory allocator can provide a 4 MiB contiguous area for the new pool. If
> > the swiotlb needs to grow after there's been memory fragmentation, an
> > added pool might be limited to 2 MiB or 1 MiB or smaller, with a
> > corresponding reduction in the area count and increase in area spin lock
> > contention. Overall, dynamic swiotlb in a big CoCo VM results in complex
> > and messy behavior with new failure modes and bottlenecks.
>
> Fragmentation indeed undermine the requested allocation size as dynamic
> swiotlb grows by default 64M .

Actually, I don't think the dynamic swiotlb ever grows by 64 MiB, at least
not for systems with a 4 KiB page size. Yes, swiotlb_dyn_alloc() requests
a new pool with size "default_nslabs", which might be 64 MiB. But the
memory ultimately comes from the buddy allocator, which can provide
a maximum of 4 MiB (unless MAX_ORDER has been overridden).
swiotlb_alloc_pool() tries 64 MiB, which fails, so it then tries 32 MiB,
etc., until it gets down to a size that the buddy allocator can provide.
The most it will get is 4 MiB, and perhaps less if memory is fragmented.

Michael

> In production practice, for large TVM, we see admin will reserve
> majority of needed size, and let dynamic swiotlb works to add additional buffer.
> Current dynamic switlb doesn't support shrinker to release unused memory back
> to buddy yet.