Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

From: James Morse
Date: Wed Jan 30 2019 - 09:56:59 EST


Hi guys,

On 01/29/2019 06:10 PM, Catalin Marinas wrote:
Could you please copy the whole description from the cover letter to the
actual patch and only send one email (full description as in here
together with the patch)? If we commit this to the kernel, it would be
useful to have the information in the log for reference later on.

More comments below:

On Tue, Jan 29, 2019 at 12:29:58PM +0000, Zhang, Lei wrote:
On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
memory accesses may cause undefined fault (Data abort, DFSC=0b111111).
This problem will be fixed by next version of Fujitsu-A64FX.

This fault occurs under a specific hardware condition
when a load/store instruction perform an address translation using:
case-1 TTBR0_EL1 with TCR_EL1.NFD0 == 1.
case-2 TTBR0_EL2 with TCR_EL2.NFD0 == 1.
case-3 TTBR1_EL1 with TCR_EL1.NFD1 == 1.
case-4 TTBR1_EL2 with TCR_EL2.NFD1 == 1.
And this fault occurs completely spurious.

So this looks like new information on the hardware behaviour since the
v2 of the patch. Can this fault occur for any type of instruction
accessing the memory or only for SVE instructions?

Since TCR_ELx.NFD1 is set to '1' at the kernel in versions
past 4.17, the case-3 or case-4 may happen.

This fault can be taken only at stage-1,
so this fault is taken from EL0 to EL1/EL2, from EL1 to EL1,
or from EL2 to EL2.

I would like to post a workaround to avoid this problem on
existing Fujitsu-A64FX version.

How likely is it to trigger this erratum? In other words, aren't we
better off with a spurious fault that we ignore rather than toggling the
TCR_ELx.NFD1 bit?

It sounds like the spurious fault can occur as a result of load/store. ('there is no load/store instruction between'...).

If this can happen in kernel_enter it will overwrite the exception registers, and we lose the original ELR.

If load/store trigger it, I don't think we can ignore it.

Thanks,

James