RE: [PATCH] swiotlb: eliminate per-map atomic contention on used/hiwater tracking

From: Michael Kelley

Date: Thu Jun 25 2026 - 11:55:05 EST

From: Du, Fan <fan.du@xxxxxxxxx> Sent: Thursday, June 25, 2026 12:30 AM
>
> > -----Original Message-----
> > From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> > Sent: Tuesday, June 23, 2026 10:35 AM
> > To: Miao, Jun <jun.miao@xxxxxxxxx>; m.szyprowski@xxxxxxxxxxx;
> > robin.murphy@xxxxxxx
> > Cc: iommu@xxxxxxxxxxxxxxx; chenhgs@xxxxxxxxxxxxxxx; Du, Fan
> > <fan.du@xxxxxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>
> > Subject: RE: [PATCH] swiotlb: eliminate per-map atomic contention on
> > used/hiwater tracking
> >

[snip]

> > > + pool = &mem->defpool;
> > > + for (i = 0; i < pool->nareas; i++)
> > > + hiwater += READ_ONCE(pool->areas[i].used_hiwater);
> >
> > Let's ignore the SWIOTLB_DYNAMIC case for simplicity. The approach
> > of calculating a separate hiwater mark for each area, and then summing
> > those per-area hiwater marks, can produce very wrong results.
> >
> > Consider a 64MiB swiotlb in a system with 8 CPUs. There will be 8 areas,
> > each with 8 MiB of space. Suppose the workload putters along with mostly
> > smallish I/Os, say between 4 KiB and 32 KiB. If each area has 16 I/Os in
> > progress, the area hiwater mark might be 256 KiB (16 I/Os averaging 16 KiB
> > each). Summing across areas produces a hiwater mark of 8 * 256 KiB = 2 MiB.
> > But then suppose a 2 MiB I/O comes in. The hiwater mark for the area that
> > handles that I/O will grow to 2+ MiB. After the first big I/O finishes,
> > another 2 MiB I/O comes in that is handled by a different area, whose
> > hiwater mark also goes to 2+ MiB. Pretty soon all 8 areas have a hiwater
> > mark of 2+ MiB, and the total hiwater mark is reported as 16+ MiB. The
> > old algorithm would have reported 4+ MiB, which is accurate. With
> > higher CPU counts and more areas, the discrepancy can get much worse.
> > This is a somewhat contrived example, but the problem is real enough
> > to make the reported hiwater mark be unreliable.
>
> You are correct here, I like your way of thinking.
>
> > I'm sure the contention for the total hiwater mark in the current code is
> > real, but it's in the context of a lot of other CPU work that is being because
> > of the bounce buffering, including copying lots of data to/from the bounce
> > buffers. Is that atomic increment operation a bottleneck even in the
> > end-to-end context of doing DMA through swiotlb bounce buffers?
>
> Practical benchmark show case the highest IO performance for each TVM spec.
> even if a few iperf (4)workers would cause the contention here.
> My guts feelings, yes, the real workload probably hit the bottleneck here.

Just curious -- what is the NIC in the TDX VM? I'm most familiar with the
Hyper-V case, where the NIC is the Hyper-V synthetic NIC. That driver
uses dedicated send and receive buffers that are allocated and decrypted
when the NIC is configured. Most NIC traffic goes through those buffers
instead of the swiotlb, so I probably haven't seen cases where the swiotlb
is the bottleneck for NIC traffic. I do see the swiotlb as the bottleneck for
disk I/O traffic, but the data copying tends to be the gate rather than the
allocation and freeing of swiotlb buffers.

>
> > Another approach to the contention problem would be to have a separate
> > CONFIG option that is narrower than CONFIG_DEBUG_FS, so that the
> > computation of the hiwater mark can be dropped entirely in production
> > environments. Or the setting could be dynamic at runtime via a
> > static_call, defaulting to not computing the hiwater mark while still
> > allowing a sysadmin to turn it on to see workload usage of the swiotlb.
>
> That's counter-intuitive from my perspective.
> With global counters, the observation, which itself impacts the performance,
> wouldn't be able to tell the practical characterization, that's commonly lower than
> max performance, in turn breaks the semantics of what's it for.

Agreed. If the global counters affect the performance and throughput
significantly, having an accurate hiwater mark loses some of its value.

>
> Even without those global counters, if user wants to know the hiwater value,
> snapshotting used value(sum of each area as current behavior) periodically would
> produce meaningful value for workload evaluation.

I'm a little skeptical of the value of just summing current usage. Doing so
tends to miss any spikes, and the spikes are the problem. If swiotlb capacity
is exceeded even for a short spike, you don't just get a performance blip.
You get I/O failures, which at least on the disk side tends to be fatal to the
application doing the I/O. Maybe the networking stack recovers well enough
and retries, resulting in just a performance reduction. But I've always thought
of swiotlb exhaustion as a fairly serious problem to be avoided at all costs.
That's why CoCo VMs allocate so much swiotlb space, even though most of
it is never used for typical workloads (at least in my experience).

Michael