RE: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

From: Zhang, Lei
Date: Tue Feb 05 2019 - 07:49:37 EST


Hi Catalin,

> -----Original Message-----
> From: Catalin Marinas [mailto:catalin.marinas@xxxxxxx]
> Sent: Wednesday, January 30, 2019 3:11 AM
> To: Zhang, Lei
> Cc: 'linux-kernel@xxxxxxxxxxxxxxx'; 'Mark Rutland';
> 'linux-arm-kernel@xxxxxxxxxxxxxxxxxxx'; 'will.deacon@xxxxxxx';
> 'james.morse@xxxxxxx'
> Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX
> erratum 010001
>
> Could you please copy the whole description from the cover letter to the
> actual patch and only send one email (full description as in here
> together with the patch)? If we commit this to the kernel, it would be
> useful to have the information in the log for reference later on.

Thank you for your suggestion. I will send one email with whole description.

> So this looks like new information on the hardware behaviour since the
> v2 of the patch. Can this fault occur for any type of instruction
> accessing the memory or only for SVE instructions?

This erratum is that any load/store instruction, including Armv8 and SVE,
except non-fault access might occur a spurious fault.

> How likely is it to trigger this erratum? In other words, aren't we
> better off with a spurious fault that we ignore rather than toggling the
> TCR_ELx.NFD1 bit?

Although the erratum occurs exceptionally rare, this path is required
to handle the issue pointed out by James and Mark in:
https://lkml.org/lkml/2019/1/22/533,
https://lkml.org/lkml/2019/1/22/642.

As James and Mark pointed, if the erratum occurs at EL1/EL2 before
system registers, ELR and SPSR, are backed up, these registers will
be overwritten and we will lose that information.

So, we set the TCR_ELx.NFD1=0 during EL1/EL2.
Please see the supplemental explanation in the end of this mail.

> The problem is that this bit may be cached in the TLB (I haven't checked
> the ARM ARM but that's usually the case with the TCR_ELx bits). If
> that's the case, you can't guarantee a change unless you also perform
> a
> TLBI VMALL. Arguably, if Fujitsu's microarchitecture doesn't cache the
> NFD bits in the TLB, we could apply the workaround but I'd rather have
> the spurious trap if it's not too often.

It is not necessary to perform a TLBI VMALL in A64FX microarchitecture
to guarantee a change of TCR_ELx.{NFD0,NFD1}.

> Could speculative loads also trigger this? Another option would be to
> toggle it during kernel_neon_begin/end (with the caveat of TLBI as
> mentioned above).

No, a speculative load does not trigger this erratum.

Here are supplemental explanations:

Since this erratum occurs only when TCR_ELx.NFD1=1,
we keep TCR_ELx.NFD1=0 during EL1/EL2.
By doing so, the erratum occurs only in EL0 and the
spurious trap can be handled by the fault handler.

To keep TCR_ELx.NFD1=0 in EL1/EL2, there are two critical
sections to assure the completeness of the implementation.
One is the transition from EL0 to EL1/EL2 and the other
is from EL1/EL2 to EL0

For the former case, I set TCR_ELx.NFD1=0 at codes tramp_map_kernel.
And there is no load/store instruction before setting
TCR_ELx.NFD1=0 at EL1/EL2, so undefined fault will not be happened.

For the latter case, I set TCR_ELx.NFD1=1 at codes tramp_unmap_kernel.
And there is no load/store instruction after setting
TCR_ELx.NFD1=1 at EL1/EL2, so undefined fault will not be happened.

To handle the spurious fault in EL0,
I replace the fault handler for Data abort DFSC=0b111111 with
a new fault handler to ignore this spurious fault caused by the erratum.

Thanks,
Zhang Lei