RE: [PATCH] swiotlb: eliminate per-map atomic contention on used/hiwater tracking

From: Du, Fan

Date: Thu Jun 25 2026 - 23:12:34 EST

> -----Original Message-----
> From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> Sent: Thursday, June 25, 2026 11:54 PM
> To: Du, Fan <fan.du@xxxxxxxxx>; Michael Kelley <mhklinux@xxxxxxxxxxx>;
> Miao, Jun <jun.miao@xxxxxxxxx>; m.szyprowski@xxxxxxxxxxx;
> robin.murphy@xxxxxxx
> Cc: iommu@xxxxxxxxxxxxxxx; chenhgs@xxxxxxxxxxxxxxx; LKML <linux-
> kernel@xxxxxxxxxxxxxxx>
> Subject: RE: [PATCH] swiotlb: eliminate per-map atomic contention on
> used/hiwater tracking
>
> From: Du, Fan <fan.du@xxxxxxxxx> Sent: Thursday, June 25, 2026 12:30 AM
> >
> > > -----Original Message-----
> > > From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> > > Sent: Tuesday, June 23, 2026 10:35 AM
> > > To: Miao, Jun <jun.miao@xxxxxxxxx>; m.szyprowski@xxxxxxxxxxx;
> > > robin.murphy@xxxxxxx
> > > Cc: iommu@xxxxxxxxxxxxxxx; chenhgs@xxxxxxxxxxxxxxx; Du, Fan
> > > <fan.du@xxxxxxxxx>; LKML <linux-kernel@xxxxxxxxxxxxxxx>
> > > Subject: RE: [PATCH] swiotlb: eliminate per-map atomic contention on
> > > used/hiwater tracking
> > >
>
> [snip]
>
> > > > + pool = &mem->defpool;
> > > > + for (i = 0; i < pool->nareas; i++)
> > > > + hiwater += READ_ONCE(pool->areas[i].used_hiwater);
> > >
> > > Let's ignore the SWIOTLB_DYNAMIC case for simplicity. The approach
> > > of calculating a separate hiwater mark for each area, and then summing
> > > those per-area hiwater marks, can produce very wrong results.
> > >
> > > Consider a 64MiB swiotlb in a system with 8 CPUs. There will be 8 areas,
> > > each with 8 MiB of space. Suppose the workload putters along with
> mostly
> > > smallish I/Os, say between 4 KiB and 32 KiB. If each area has 16 I/Os in
> > > progress, the area hiwater mark might be 256 KiB (16 I/Os averaging 16 KiB
> > > each). Summing across areas produces a hiwater mark of 8 * 256 KiB = 2
> MiB.
> > > But then suppose a 2 MiB I/O comes in. The hiwater mark for the area
> that
> > > handles that I/O will grow to 2+ MiB. After the first big I/O finishes,
> > > another 2 MiB I/O comes in that is handled by a different area, whose
> > > hiwater mark also goes to 2+ MiB. Pretty soon all 8 areas have a hiwater
> > > mark of 2+ MiB, and the total hiwater mark is reported as 16+ MiB. The
> > > old algorithm would have reported 4+ MiB, which is accurate. With
> > > higher CPU counts and more areas, the discrepancy can get much worse.
> > > This is a somewhat contrived example, but the problem is real enough
> > > to make the reported hiwater mark be unreliable.
> >
> > You are correct here, I like your way of thinking.
> >
> > > I'm sure the contention for the total hiwater mark in the current code is
> > > real, but it's in the context of a lot of other CPU work that is being because
> > > of the bounce buffering, including copying lots of data to/from the bounce
> > > buffers. Is that atomic increment operation a bottleneck even in the
> > > end-to-end context of doing DMA through swiotlb bounce buffers?
> >
> > Practical benchmark show case the highest IO performance for each TVM
> spec.
> > even if a few iperf (4)workers would cause the contention here.
> > My guts feelings, yes, the real workload probably hit the bottleneck here.
>
> Just curious -- what is the NIC in the TDX VM? I'm most familiar with the
> Hyper-V case, where the NIC is the Hyper-V synthetic NIC. That driver
> uses dedicated send and receive buffers that are allocated and decrypted
> when the NIC is configured. Most NIC traffic goes through those buffers
> instead of the swiotlb, so I probably haven't seen cases where the swiotlb
> is the bottleneck for NIC traffic. I do see the swiotlb as the bottleneck for
> disk I/O traffic, but the data copying tends to be the gate rather than the
> allocation and freeing of swiotlb buffers.
>
> >
> > > Another approach to the contention problem would be to have a separate
> > > CONFIG option that is narrower than CONFIG_DEBUG_FS, so that the
> > > computation of the hiwater mark can be dropped entirely in production
> > > environments. Or the setting could be dynamic at runtime via a
> > > static_call, defaulting to not computing the hiwater mark while still
> > > allowing a sysadmin to turn it on to see workload usage of the swiotlb.
> >
> > That's counter-intuitive from my perspective.
> > With global counters, the observation, which itself impacts the
> performance,
> > wouldn't be able to tell the practical characterization, that's commonly
> lower than
> > max performance, in turn breaks the semantics of what's it for.
>
> Agreed. If the global counters affect the performance and throughput
> significantly, having an accurate hiwater mark loses some of its value.
>
> >
> > Even without those global counters, if user wants to know the hiwater value,
> > snapshotting used value(sum of each area as current behavior) periodically
> would
> > produce meaningful value for workload evaluation.
>
> I'm a little skeptical of the value of just summing current usage. Doing so
> tends to miss any spikes, and the spikes are the problem. If swiotlb capacity

That's current design when CONFIG_DEBUG_FS is off, and used as swiotlb
shortage indicator for user.

Statistically that sampled value is approximate to the true value as always.

> is exceeded even for a short spike, you don't just get a performance blip.
> You get I/O failures, which at least on the disk side tends to be fatal to the
> application doing the I/O. Maybe the networking stack recovers well enough
> and retries, resulting in just a performance reduction. But I've always thought
> of swiotlb exhaustion as a fairly serious problem to be avoided at all costs.
> That's why CoCo VMs allocate so much swiotlb space, even though most of
> it is never used for typical workloads (at least in my experience).

That's dynamic SWIOTLB is designed for.
Only when the IO is so intensive, transient buffer/DMA pool is exhausted quickly
before new shared memory pool is created.

> Michael