Re: [PATCH v4 2/2] arm64: io: apply the device store-release workaround once per block write

From: Shanker Donthineni

Date: Mon Jun 29 2026 - 19:09:48 EST

Hi Vladimir,

On 6/29/2026 5:48 AM, Vladimir Murzin wrote:

External email: Use caution opening links or attachments

Hi,

On 6/25/26 19:24, Shanker Donthineni wrote:

The generic memset_io()/memcpy_toio() are built on __raw_write*(), so on
parts with the NVIDIA Olympus device store/load ordering erratum the
ARM64_WORKAROUND_DEVICE_STORE_RELEASE workaround promotes every store in
the block to a store-release. Each stlr* carries a barrier cost, so block
MMIO becomes O(n) store-releases, making a block copy many times slower
than a single ordered burst and growing with the transfer size.

Provide arm64 memset_io()/memcpy_toio() that emit plain str* in the loop
and order the whole block against subsequent loads with a single
trailing dmb osh on affected CPUs (a no-op elsewhere, preserving the
relaxed contract of these helpers). This keeps block MMIO writes at
one-barrier cost rather than scaling with the transfer size.

Performance (NVIDIA Olympus, write-combining MMIO to a device BAR, single
PE pinned; per-call cost in ns; consecutive writes ping-pong between two
buffers so repeated stores are not coalesced; iowrite64/iowrite32 =
__iowrite{64,32}_copy()):

Table 1 - arm64 memset_io/memcpy_toio (this patch)
+-------+-----------+-----------+-----------+-------------+
| size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-----------+-------------+
| 8B | 231.6 ns | 231.6 ns | 232.4 ns | 232.4 ns |
| 16B | 231.7 ns | 231.9 ns | 232.7 ns | 232.6 ns |
| 32B | 231.9 ns | 232.7 ns | 232.9 ns | 232.9 ns |
| 64B | 232.7 ns | 235.0 ns | 233.7 ns | 233.6 ns |
| 128B | 233.6 ns | 235.8 ns | 234.4 ns | 234.3 ns |
| 256B | 237.7 ns | 276.8 ns | 264.0 ns | 276.7 ns |
| 512B | 237.7 ns | 277.1 ns | 238.1 ns | 277.6 ns |
| 1KB | 253.7 ns | 279.3 ns | 276.1 ns | 294.1 ns |
| 2KB | 295.0 ns | 318.7 ns | 288.5 ns | 308.3 ns |
| 4KB | 365.9 ns | 381.4 ns | 365.7 ns | 381.3 ns |
+-------+-----------+-----------+-----------+-------------+
all four helpers end with a single trailing barrier (dmb osh).

Table 2 - generic per-store memset_io/memcpy_toio
+-------+-----------+-----------+-------------+--------------+
| size | iowrite64 | iowrite32 | memset_io | memcpy_toio |
+-------+-----------+-----------+-------------+--------------+
| 8B | 231.6 ns | 231.6 ns | 229.0 ns | 229.0 ns |
| 16B | 231.7 ns | 231.9 ns | 458.4 ns | 458.5 ns |
| 32B | 231.9 ns | 232.7 ns | 917.4 ns | 917.5 ns |
| 64B | 232.7 ns | 234.8 ns | 1835.4 ns | 1835.5 ns |
| 128B | 233.6 ns | 235.8 ns | 3670.9 ns | 3670.8 ns |
| 256B | 237.7 ns | 276.7 ns | 7341.6 ns | 7341.6 ns |
| 512B | 237.7 ns | 279.4 ns | 14001.4 ns | 14001.3 ns |
| 1KB | 253.7 ns | 279.1 ns | 28631.5 ns | 28631.8 ns |
| 2KB | 279.4 ns | 317.9 ns | 57276.3 ns | 57275.2 ns |
| 4KB | 365.7 ns | 381.5 ns | 114564.4 ns | 114563.6 ns |
+-------+-----------+-----------+-------------+--------------+
the generic memset_io()/memcpy_toio() build on __raw_write*(), which the
workaround promotes to store-release, so every store is individually
ordered - hence O(n) in the store count.

The arm64 versions stay flat at one-barrier cost while the generic
per-store writers collapse to O(n): at 4KB ~314x slower (~115 us vs
~366 ns).

Signed-off-by: Shanker Donthineni <sdonthineni@xxxxxxxxxx>
---
arch/arm64/include/asm/io.h | 5 +++
arch/arm64/kernel/io.c | 82 +++++++++++++++++++++++++++++++++++++
2 files changed, 87 insertions(+)

diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 69e0fa004d31..649503f347bc 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -266,6 +266,11 @@ __iowrite64_copy(void __iomem *to, const void *from, size_t count)
}
#define __iowrite64_copy __iowrite64_copy

+void memset_io(volatile void __iomem *dst, int c, size_t count);
+#define memset_io memset_io
+void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count);
+#define memcpy_toio memcpy_toio
+
/*
* I/O memory mapping functions.
*/
diff --git a/arch/arm64/kernel/io.c b/arch/arm64/kernel/io.c
index fe86ada23c7d..b5fd9ee6d9eb 100644
--- a/arch/arm64/kernel/io.c
+++ b/arch/arm64/kernel/io.c
@@ -5,9 +5,91 @@
* Copyright (C) 2012 ARM Ltd.
*/

+#include <linux/align.h>
#include <linux/export.h>
#include <linux/types.h>
#include <linux/io.h>
+#include <linux/unaligned.h>
+
+#include <asm/alternative.h>
+
+/*
+ * ARM64_WORKAROUND_DEVICE_STORE_RELEASE promotes every raw MMIO store
+ * (__raw_write*()) to a store-release on affected CPUs. The generic
+ * memset_io()/memcpy_toio() are built on those helpers, so the workaround would
+ * emit one store-release per element and turn a block write into O(n) ordered
+ * stores - far more costly than the single barrier a block actually needs.
+ *
+ * Provide arm64 versions that emit plain STR in the loop and order the whole
+ * block against subsequent loads with one trailing DMB OSH, patched in only on
+ * affected CPUs (a no-op elsewhere, so the relaxed contract of these helpers is
+ * preserved).
+ *
+ * This capability is currently enabled only for the NVIDIA Olympus device
+ * store/load ordering erratum, where a Device-nGnR* load may be observed before
+ * an older, non-overlapping Device-nGnR* store to the same peripheral.
+ */
+static __always_inline void iomem_block_store_barrier(void)
+{
+ asm volatile(ALTERNATIVE("nop", "dmb osh",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : : "memory");
+}
+
+void memset_io(volatile void __iomem *dst, int c, size_t count)
+{
+ u64 qc = (u8)c;
+
+ qc *= ~0ULL / 0xff;
+
+ while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
+ asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
+ dst++;
+ count--;
+ }
+ while (count >= sizeof(u64)) {
+ asm volatile("str %x0, [%1]" : : "rZ"(qc), "r"(dst) : "memory");
+ dst += sizeof(u64);
+ count -= sizeof(u64);
+ }
+ while (count) {
+ asm volatile("strb %w0, [%1]" : : "rZ"((u8)c), "r"(dst) : "memory");
+ dst++;
+ count--;
+ }
+
+ iomem_block_store_barrier();
+}
+EXPORT_SYMBOL(memset_io);
+
+void memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
+{
+ while (count && !IS_ALIGNED((__force unsigned long)dst, sizeof(u64))) {
+ asm volatile("strb %w0, [%1]"
+ : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
+ src++;
+ dst++;
+ count--;
+ }
+ while (count >= sizeof(u64)) {
+ asm volatile("str %x0, [%1]"
+ : : "rZ"(get_unaligned((const u64 *)src)), "r"(dst)

Why do we need get_unaligned() here? I understand this came from
the generic implementation, where it needs to handle architectures
that do not support unaligned accesses. But IIUC this is not an
issue for arm64, and there was no special handling in memcpy_toio()
before 0110feaaf6d0 ("arm64: Use new fallback IO memcpy/memset").
Am I missing something?

Thanks for the review.

I used get_unaligned() because I was trying to keep the arm64 implementation
as close as possible to the generic memcpy_toio() implementation in
lib/iomem_copy.c. However, you are right that before commit 0110feaaf6d0
(“arm64: Use new fallback IO memcpy/memset”), the arm64 implementation
used a direct u64 load and did not explicitly handle source alignment. I
can restore the previous arm64 form in v5 if that is preferred.

+ : "memory");
+ src += sizeof(u64);
+ dst += sizeof(u64);
+ count -= sizeof(u64);
+ }
+ while (count) {
+ asm volatile("strb %w0, [%1]"
+ : : "rZ"(*(const u8 *)src), "r"(dst) : "memory");
+ src++;
+ dst++;
+ count--;
+ }
+
+ iomem_block_store_barrier();

It is perhaps a matter of taste, but having the inline assembly
here (and in memset_io()) might make the code clearer. To a
casual reader, it would be obvious that the barrier is not
guaranteed and is only applicable to ARM64_WORKAROUND_DEVICE_STORE_RELEASE,
without having to jump back and forth through the code.

Obliviously maintainers might have different preference ;)

Regarding the barrier, iomem_block_store_barrier() is declared
static __always_inline, so it does not add a function call. The nop/dmb
osh alternative is emitted directly in each caller. I used the helper to
avoid duplicating the alternative sequence.

I understand that placing the assembly directly in both functions could
make its conditional nature more obvious. I do not have a strong preference
and am happy to follow Will’s and Catalin’s preference here.

-Shanker