Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum

From: Shanker Donthineni

Date: Thu Jun 11 2026 - 21:14:17 EST

Hi Will,

On 6/11/2026 8:39 AM, sdonthineni@xxxxxxxxxx wrote:

-----Original Message-----
From: Will Deacon <will@xxxxxxxxxx>
Sent: Thursday, June 11, 2026 8:34 AM
To: Shanker Donthineni <sdonthineni@xxxxxxxxxx>
Cc: Catalin Marinas <catalin.marinas@xxxxxxx>; Vladimir Murzin <vladimir.murzin@xxxxxxx>; Jason Gunthorpe <jgg@xxxxxxxxxx>; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; Mark Rutland <mark.rutland@xxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx; Vikram Sethi <vsethi@xxxxxxxxxx>; Jason Sequeira <jsequeira@xxxxxxxxxx>
Subject: Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum

External email: Use caution opening links or attachments

On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote:

On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
observed by a peripheral before an older, non-overlapping Device-nGnR*
store to the same peripheral. This breaks the program-order guarantee
that software expects for Device-nGnR* accesses and can leave a
peripheral in an incorrect state, as a load is observed before an
earlier store takes effect.

The erratum can occur only when all of the following apply:

- A PE executes a Device-nGnR* store followed by a younger
Device-nGnR* load.
- The store is not a store-release.
- The accesses target the same peripheral and do not overlap in bytes.
- There is at most one intervening Device-nGnR* store in program
order, and there are no intervening Device-nGnR* loads.
- There is no DSB, and no DMB that orders loads, between the store and
the load.
- Specific micro-architectural and timing conditions occur.

Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain
str* to stlr* (Store-Release), which removes the "store is not a
store-release" condition for every device write the kernel issues.
Because writel() and writel_relaxed() are both built on __raw_writel()
in asm-generic/io.h, patching the raw variants covers both the
non-relaxed and relaxed APIs without touching the higher layers. Note
that writel()'s own barrier sits before the store, so it does not
order the store against a subsequent readl(); the store-release
promotion is what provides that ordering.

Like ARM64_ERRATUM_832075 on the load side, the change is gated on a
new ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only
activated on parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs
continue to use the plain str* sequence.

Note: stlr* only supports base-register addressing, so affected CPUs
use a base-register stlr* path. Unaffected CPUs keep the original
offset-addressed str* sequence introduced by commit d044d6ba6f02
("arm64: io: permit offset addressing").

The __const_memcpy_toio_aligned32() and
__const_memcpy_toio_aligned64() helpers are left unchanged. These
helpers are intended for write-combining mappings, which are Normal-NC
on arm64. Replacing their contiguous str* groups would defeat the
write-combining behavior used to improve store performance.

Co-developed-by: Vikram Sethi <vsethi@xxxxxxxxxx>
Signed-off-by: Vikram Sethi <vsethi@xxxxxxxxxx>
Signed-off-by: Shanker Donthineni <sdonthineni@xxxxxxxxxx>
Reviewed-by: Catalin Marinas <catalin.marinas@xxxxxxx>
---
Changes since v2:
- Reworked the raw MMIO write helpers so unaffected CPUs keep the
existing offset-addressed STR sequence, while affected CPUs use the
base-register STLR path.
- Updated the commit message to match the code changes.
- Rebased on top of the arm64 for-next/errata branch:
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h
=for-next/errata

Changes since v1:
- Updated the commit message based on feedback from Vladimir Murzin.

Documentation/arch/arm64/silicon-errata.rst | 2 ++
arch/arm64/Kconfig | 23 ++++++++++++++++
arch/arm64/include/asm/io.h | 30 +++++++++++++++++++++
arch/arm64/kernel/cpu_errata.c | 8 ++++++
arch/arm64/tools/cpucaps | 1 +
5 files changed, 64 insertions(+)

diff --git a/Documentation/arch/arm64/silicon-errata.rst
b/Documentation/arch/arm64/silicon-errata.rst
index ad09bbb10da80..fc45125dc2f80 100644
--- a/Documentation/arch/arm64/silicon-errata.rst
+++ b/Documentation/arch/arm64/silicon-errata.rst
@@ -298,6 +298,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | Carmel Core | N/A | NVIDIA_CARMEL_CNP_ERRATUM |
+----------------+-----------------+-----------------+----------------
-------------+
+| NVIDIA | Olympus core | T410-OLY-1027 | NVIDIA_OLYMPUS_1027_ERRATUM |
++----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | Olympus core | T410-OLY-1029 | ARM64_ERRATUM_4118414 |
+----------------+-----------------+-----------------+-----------------------------+
| NVIDIA | T241 GICv3/4.x | T241-FABRIC-4 | N/A |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index
c65cef81be86a..d633eb70de1ac 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075

If unsure, say Y.

+config NVIDIA_OLYMPUS_1027_ERRATUM
+ bool "NVIDIA Olympus: device store/load ordering erratum"
+ default y
+ help
+ This option adds an alternative code sequence to work around an
+ NVIDIA Olympus core erratum where a Device-nGnR* store can be
+ observed by a peripheral after a younger Device-nGnR* load to the
+ same peripheral. This breaks the program order that drivers rely
+ on for MMIO and can leave a device in an incorrect state.
+
+ The workaround promotes the raw MMIO store helpers
+ (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
+ required ordering. Because writel() and writel_relaxed() are built
+ on __raw_writel(), both are covered without changes to the higher
+ layers.
+
+ The fix is applied through the alternatives framework, so enabling
+ this option does not by itself activate the workaround: it is
+ patched in only when an affected CPU is detected, and is a no-op on
+ unaffected CPUs.
+
+ If unsure, say Y.
+
config ARM64_ERRATUM_834220
bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
depends on KVM
diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
index 8cbd1e96fd50b..801223e754c90 100644
--- a/arch/arm64/include/asm/io.h
+++ b/arch/arm64/include/asm/io.h
@@ -22,10 +22,22 @@
/*
* Generic IO read/write. These perform native-endian accesses.
*/
+static __always_inline bool arm64_needs_device_store_release(void)
+{
+ return alternative_has_cap_unlikely(
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
+}
+
#define __raw_writeb __raw_writeb
static __always_inline void __raw_writeb(u8 val, volatile void
__iomem *addr) {
volatile u8 __iomem *ptr = addr;
+
+ if (arm64_needs_device_store_release()) {
+ asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
+ return;
+ }
+
asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr)); }

Use an 'else' clause instead of the early return? (similarly for the other changes).

I still reckon you should do something with the memcpy-to-io routines.
A simple option could be to make dgh() a dmb on parts with the erratum?
That at least moves the barrier out of the loop.

Thanks Will. I looked again at both the arm64 comments and the generic iomap_copy.c
contract, and I’m not convinced that making dgh() a dmb is the right fit for this
path. Based on the documented comments, callers should not assume ordering from
these helpers; if ordering is required around a memcpy, the call site should already
be providing the necessary barriers.

Related data point in generic lib/iomap_copy.c:

/**
* __iowrite32_copy - copy data to MMIO space, in 32-bit units
* @to: destination, in MMIO space (must be 32-bit aligned)
* @from: source (must be 32-bit aligned)
* @count: number of 32-bit quantities to copy
*
* Copy data from kernel space to MMIO space, in units of 32 bits at a
* time. Order of access is not guaranteed, nor is a memory barrier
* performed afterwards.
*/
#ifndef __iowrite32_copy
void __iowrite32_copy(void __iomem *to, const void *from, size_t count)

/**
* __iowrite64_copy - copy data to MMIO space, in 64-bit or 32-bit units
* @to: destination, in MMIO space (must be 64-bit aligned)
* @from: source (must be 64-bit aligned)
* @count: number of 64-bit quantities to copy
*
* Copy data from kernel space to MMIO space, in units of 32 or 64 bits at a
* time. Order of access is not guaranteed, nor is a memory barrier
* performed afterwards.
*/
#ifndef __iowrite64_copy
void __iowrite64_copy(void __iomem *to, const void *from, size_t count)

/**
* __iowrite32_copy - copy data to MMIO space, in 32-bit units
* @to: destination, in MMIO space (must be 32-bit aligned)
* @from: source (must be 32-bit aligned)
* @count: number of 32-bit quantities to copy
*
* Copy data from kernel space to MMIO space, in units of 32 bits at a
* time. Order of access is not guaranteed, nor is a memory barrier
* performed afterwards.
*/
#ifndef __iowrite32_copy
void __iowrite32_copy(void __iomem *to, const void *from, size_t count)

The arm64 comment says in arch/arm64/asm/io.h:

/*
* The ARM64 iowrite implementation is intended to support drivers that want to
* use write combining. For instance PCI drivers using write combining with a 64
* byte __iowrite64_copy() expect to get a 64 byte MemWr TLP on the PCIe bus.
*
* Newer ARM core have sensitive write combining buffers, it is important that
* the stores be contiguous blocks of store instructions. Normal memcpy
* approaches have a very low chance to generate write combining.
*
* Since this is the only API on ARM64 that should be used with write combining
* it also integrates the DGH hint which is supposed to lower the latency to
* emit the large TLP from the CPU.
*/

So my reading is that dgh() in the arm64 implementation is there for the
write-combining/gathering behavior. Replacing it with dmb would make this
path stronger than the generic API contract and could penalize performance
of the WC use case.

For the scalar MMIO helpers, the workaround promotes the raw writes to
store-release on affected CPUs as v1/v2 shown below. For the memcpy-toIO
helpers, could you please clarify the specific reason for adding a dmb despite
the documented no-ordering contract? Is the concern that some drivers may
be relying on ordering across memcpy_toio_*() today even though the API
does not guarantee it, and that we should cover those cases defensively?

Would prefer to avoid replacing DGH() with DMB unless there is a strong
reason to do so. Please let me know if I can post the v4 patch with
the change below, while keeping DGH() as-is in the memcpy-toIO path.

#define __raw_writeb __raw_writeb
static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
{
- volatile u8 __iomem *ptr = addr;
- asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("strb %w0, [%1]",
+ "stlrb %w0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}

#define __raw_writew __raw_writew
static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
{
- volatile u16 __iomem *ptr = addr;
- asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("strh %w0, [%1]",
+ "stlrh %w0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}

#define __raw_writel __raw_writel
static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
{
- volatile u32 __iomem *ptr = addr;
- asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("str %w0, [%1]",
+ "stlr %w0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}

#define __raw_writeq __raw_writeq
static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
{
- volatile u64 __iomem *ptr = addr;
- asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
+ asm volatile(ALTERNATIVE("str %x0, [%1]",
+ "stlr %x0, [%1]",
+ ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+ : : "rZ" (val), "r" (addr));
}

-Shanker