Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

From: Mark Rutland
Date: Tue May 16 2023 - 07:52:38 EST


On Tue, May 16, 2023 at 03:47:16PM +0800, Gang Li wrote:
> Hi,
>
> On 2023/5/9 22:30, Mark Rutland wrote:
> > For example, early in D8.13 we have the rule:
> >
> > | R_SQBCS
> > |
> > | When address translation is enabled, a translation table entry for an
> > | in-context translation regime that does not cause a Translation fault, an
> > | Address size fault, or an Access flag fault is permitted to be cached in a
> > | TLB or intermediate TLB caching structure as the result of an explicit or
> > | speculative access.
> >
>
> Thanks a lot!
>
> I looked up the x86 manual and found that the x86 TLB cache mechanism is
> similar to arm64 (but the x86 guys haven't reply me yet):
>
> Intel® 64 and IA-32 Architectures Software Developer Manuals:
> > 4.10.2.3 Details of TLB Use
> > Subject to the limitations given in the previous paragraph, the
> > processor may cache a translation for any linear address, even if that
> > address is not used to access memory. For example, the processor may
> > cache translations required for prefetches and for accesses that result
> > from speculative execution that would never actually occur in the
> > executed code path.
>
> Both architectures have similar TLB cache policies, why arm64 flush all
> and x86 flush local in ghes_map and ghes_unmap?
>
> I think flush all may be unnecessary.
>
> 1. Before accessing ghes data. Each CPU needs to call ghes_map, which
> will create the mapping and flush their own TLb to make sure the current
> CPU is using the latest mapping.
>
> 2. And there is no need to flush all in ghes_unmap, because the ghes_map
> of other CPUs will flush their own TLBs before accessing the memory.

This is not sufficient. Regardless of whether CPUs *explicitly* access the VA
range, any CPU which can reach the live translation table entry is allowed to
fetch that and allocate it into a TLB at any time.

When a Break-Before-Make sequence isn't followed, the architecture permits a
number of resulting behaviours, including "amalgamation", where the TLB entries
are combined in some arbitrary IMPLEMENTATION DEFINED way. The architecture
isn't very clear here, but doesn't rule out two entries being combined such
that it generates an atbirary physical address and/or such tha the MMU thinks
the entry is from an intermediate walk. In either of those cases, the CPU might
speculative access device memory (which could change the state of the system,
or cause fatal SErrors), and/or allocate further junk into TLBs.

So per the architecture, broadcast maintenance is necessary on arm64. The only
way to avoid it would be to have a local set of translation tables which are
not shared with other CPUs.

I suspect x86 might not have the same issue with amalgamation.

Thanks,
Mark.