[PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

From: Zhang, Lei
Date: Tue Jan 29 2019 - 07:30:13 EST

Next message: Leo Yan: "[PATCH v7 5/8] perf cs-etm: Change tuple from traceID-CPU# to traceID-metadata"
Previous message: Leo Yan: "[PATCH v7 4/8] perf cs-etm: Add exception number in exception packet"
Next in thread: Catalin Marinas: "Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
memory accesses may cause undefined fault (Data abort, DFSC=0b111111).
This problem will be fixed by next version of Fujitsu-A64FX.

This fault occurs under a specific hardware condition
when a load/store instruction perform an address translation using:
case-1 TTBR0_EL1 with TCR_EL1.NFD0 == 1.
case-2 TTBR0_EL2 with TCR_EL2.NFD0 == 1.
case-3 TTBR1_EL1 with TCR_EL1.NFD1 == 1.
case-4 TTBR1_EL2 with TCR_EL2.NFD1 == 1.
And this fault occurs completely spurious.

Since TCR_ELx.NFD1 is set to '1' at the kernel in versions
past 4.17, the case-3 or case-4 may happen.

This fault can be taken only at stage-1,
so this fault is taken from EL0 to EL1/EL2, from EL1 to EL1,
or from EL2 to EL2.

I would like to post a workaround to avoid this problem on
existing Fujitsu-A64FX version.

There are 2 points in this workaround.
Point1: trap from EL1 to EL1, EL2 to EL2
Set '0' to TCR_ELx.NFD1in kernel-entry,
and set '1' in kernel-exit.

From the view point of ARM specification, there is no problem to
reset TCR_ELx.{NFD0,NFD1} while in EL1/EL2, because
TCR_ELx.{NFD0,NFD1} controls whether to perform a translation
table walk in response to an access from EL0.

I confirmed that:
ãThere is no load/store instruction between
tramp_ventry and setting TCR_ELx.NFD1 to '0'.
ãThere is no load/store instruction between
setting TCR_ELx.NFD1 to '1' and tramp_exit.

Point2: trap from EL0 to EL1/EL2
Since this fault also occurs in EL0,
replace the fault handler for Data abort
DFSC=0b111111 with a new one to ignore this undefined fault.
I guarantee that a thread will stop delivering this fault code by ignore
this undefined fault.

The hardware condition which cause this fault is reset at exception entry,
therefore execution of at least one instruction is
guaranteed by this single retry.

This workaround is based on linux-5.0-rc2,
which TCR_ELx.NFD1 is set to '1'
only once at boot sequence,
and TCR_ELx.NFD0 is not set by kernel.
I will update my patch if new kernel makes some changes
about TCR_ELx.{NFD0,NFD1}.

Changes since [v1]
As Mark's review:

* Adopted errata framework.

Changes since [v2]
As Mark and James' review:

* Added framework to change TCR_ELx.NFD1.
- Change TCR_ELx.NFD1 to 0 when entry kernel.
- Change TCR_ELx.NFD1 to 1 when exit kernel.

I fully appreciate that if someone can test this patch on different chips
to verity no harmful effect on other chips.

If there is no problem on other chips, please merge this patch.

The patch based on linux-5.0-rc2.

Zhang Lei (1):
Arm64: Add workaround for Fujitsu A64FX erratum 010001

Documentation/arm64/silicon-errata.txt | 1 +
arch/arm64/Kconfig | 22 ++++++++++++++++++++++
arch/arm64/include/asm/cpucaps.h | 3 ++-
arch/arm64/include/asm/cputype.h | 4 ++++
arch/arm64/kernel/cpu_errata.c | 8 ++++++++
arch/arm64/kernel/entry.S | 16 ++++++++++++++++
arch/arm64/mm/fault.c | 16 +++++++++++++++-
arch/arm64/mm/proc.S | 20 ++++++++++++++++++++
8 files changed, 88 insertions(+), 2 deletions(-)

--
1.8.3.1

Next message: Leo Yan: "[PATCH v7 5/8] perf cs-etm: Change tuple from traceID-CPU# to traceID-metadata"
Previous message: Leo Yan: "[PATCH v7 4/8] perf cs-etm: Add exception number in exception packet"
Next in thread: Catalin Marinas: "Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]