Re: [PATCH v4 0/2] arm64: errata: NVIDIA Olympus device store/load ordering

From: Shanker Donthineni

Date: Mon Jun 29 2026 - 19:09:03 EST

Hi Vladimir,

On 6/29/2026 5:45 AM, Vladimir Murzin wrote:

External email: Use caution opening links or attachments

Hi,

On 6/25/26 19:24, Shanker Donthineni wrote:

This series works around the NVIDIA Olympus device store/load ordering
erratum (T410-OLY-1027): a Device-nGnR* load can be observed by a
peripheral before an older, non-overlapping Device-nGnR* store to the
same peripheral, breaking the program order that drivers rely on for
MMIO and potentially leaving a device in an incorrect state.

Patch 1 adds the workaround. It promotes the raw MMIO store helpers
(__raw_writeb/w/l/q, and therefore writel()/writel_relaxed()) to
store-release on affected CPUs, and promotes the trailing DGH of the
write-combining __iowrite{32,64}_copy() helpers to dmb osh. Everything is
gated on a new ARM64_WORKAROUND_DEVICE_STORE_RELEASE cpucap and patched
in only on affected parts, so it is a no-op elsewhere.

Patch 2 provides arm64 memset_io()/memcpy_toio(). The generic versions
are built on __raw_write*(), so patch 1 would promote every store in a
block to a store-release; as each STLR drains the write-combining buffer,
block MMIO becomes O(n) store-releases. The arm64 versions emit plain
STR in the loop and order the whole block with a single trailing dmb osh,
keeping block MMIO at one-barrier cost.

Performance: NVIDIA Olympus, write-combining MMIO to a device BAR, single
PE pinned; per-call cost in ns. Consecutive writes ping-pong between two
buffers so repeated stores are not coalesced. iowrite64/iowrite32 =
__iowrite{64,32}_copy().

Table 1 - workaround off (CONFIG_NVIDIA_OLYMPUS_1027_ERRATUM=n)
+-------+-----------+-----------+-----------+-------------+
| size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
| 8B | 67.9 ns | 67.8 ns | 3.6 ns | 3.6 ns |
| 16B | 67.9 ns | 67.8 ns | 4.0 ns | 4.0 ns |
| 32B | 67.9 ns | 67.9 ns | 4.6 ns | 4.6 ns |
| 64B | 69.1 ns | 69.1 ns | 69.1 ns | 69.0 ns |
| 128B | 138.3 ns | 138.3 ns | 138.4 ns | 138.3 ns |
| 256B | 276.6 ns | 276.6 ns | 276.6 ns | 276.7 ns |
| 512B | 276.6 ns | 276.5 ns | 276.6 ns | 276.6 ns |
| 1KB | 276.6 ns | 278.4 ns | 276.6 ns | 276.6 ns |
| 2KB | 278.4 ns | 278.4 ns | 275.9 ns | 276.6 ns |
| 4KB | 365.7 ns | 365.7 ns | 365.7 ns | 365.7 ns |
+-------+-----------+-----------+-----------+-------------+
relaxed/no-flush: memset_io()/memcpy_toio() issue plain stores with no
trailing dgh() or barrier, unlike __iowrite*_copy() which ends with dgh().

Table 2 - workaround on, arm64 memset_io/memcpy_toio (this series)
+-------+-----------+-----------+-----------+-------------+
| size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
| 8B | 231.6 ns | 231.6 ns | 232.4 ns | 232.4 ns |
| 16B | 231.7 ns | 231.9 ns | 232.7 ns | 232.6 ns |
| 32B | 231.9 ns | 232.7 ns | 232.9 ns | 232.9 ns |
| 64B | 232.7 ns | 235.0 ns | 233.7 ns | 233.6 ns |
| 128B | 233.6 ns | 235.8 ns | 234.4 ns | 234.3 ns |
| 256B | 237.7 ns | 276.8 ns | 264.0 ns | 276.7 ns |
| 512B | 237.7 ns | 277.1 ns | 238.1 ns | 277.6 ns |
| 1KB | 253.7 ns | 279.3 ns | 276.1 ns | 294.1 ns |
| 2KB | 295.0 ns | 318.7 ns | 288.5 ns | 308.3 ns |
| 4KB | 365.9 ns | 381.4 ns | 365.7 ns | 381.3 ns |
+-------+-----------+-----------+-----------+-------------+
all four helpers end with a single trailing barrier (dmb osh).

Table 3 - workaround on, generic per-store memset_io/memcpy_toio
+-------+-----------+-----------+-------------+--------------+
| size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-------------+--------------+
| 8B | 231.6 ns | 231.6 ns | 229.0 ns | 229.0 ns |
| 16B | 231.7 ns | 231.9 ns | 458.4 ns | 458.5 ns |
| 32B | 231.9 ns | 232.7 ns | 917.4 ns | 917.5 ns |
| 64B | 232.7 ns | 234.8 ns | 1835.4 ns | 1835.5 ns |
| 128B | 233.6 ns | 235.8 ns | 3670.9 ns | 3670.8 ns |
| 256B | 237.7 ns | 276.7 ns | 7341.6 ns | 7341.6 ns |
| 512B | 237.7 ns | 279.4 ns | 14001.4 ns | 14001.3 ns |
| 1KB | 253.7 ns | 279.1 ns | 28631.5 ns | 28631.8 ns |
| 2KB | 279.4 ns | 317.9 ns | 57276.3 ns | 57275.2 ns |
| 4KB | 365.7 ns | 381.5 ns | 114564.4 ns | 114563.6 ns |
+-------+-----------+-----------+-------------+--------------+
the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
workaround promotes to store-release, so every store is individually
ordered - hence O(n) in the store count.

Tables 2 and 3 show why patch 2 is needed: the generic per-store block
writers collapse to O(n) under the workaround (4KB ~314x slower, ~115 us
vs ~366 ns), while the arm64 versions stay flat at one-barrier cost.

That's interesting. With the way the patch set is structured, it
now looks like:

1. Fix the erratum, but cause a performance regression.
2. Restore the performance regression and (re)apply the erratum
workaround.

Would it make sense to avoid introducing the performance
regression in the first place by structuring the patch set
slightly differently?

1. (Re)introduce arm64 memset_io()/memcpy_toio().
2. Fix the erratum once for all

What do you reckon?

Yes, that ordering makes sense.

I can restructure v5 so that patch 1 introduces the arm64 memset_{to}io()
implementations while preserving the existing behavior. Patch 2 will
then add the complete erratum workaround, including the conditional
trailing DMB for those block-write helpers. This avoids introducing
the intermediate performance regression and keeps each commit
independently usable.

Will and Catalin, could you please share your thoughts on this approach?

-Shanker