Re: [RFC PATCH] membarrier: riscv: Provide core serializing command

From: Mathieu Desnoyers
Date: Fri Aug 04 2023 - 16:05:31 EST


On 8/4/23 15:16, Andrea Parri wrote:
On Fri, Aug 04, 2023 at 02:05:55PM -0400, Mathieu Desnoyers wrote:
On 8/4/23 10:59, Andrea Parri wrote:
What is the relationship between FENCE.I and instruction cache flush on
RISC-V ?

The exact nature of this relationship is implementation-dependent. From
commentary included in the ISA portion referred to in the changelog:

A simple implementation can flush the local instruction cache and
the instruction pipeline when the FENCE.I is executed. A more
complex implementation might snoop the instruction (data) cache on
every data (instruction) cache miss, or use an inclusive unified
private L2 cache to invalidate lines from the primary instruction
cache when they are being written by a local store instruction. If
instruction and data caches are kept coherent in this way, or if
the memory system consists of only uncached RAMs, then just the
fetch pipeline needs to be flushed at a FENCE.I. [..]

Mmh, does this help?

Quoting

https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf

Chapter 3 "“Zifencei” Instruction-Fetch Fence, Version 2.0"

"First, it has been recognized that on some systems, FENCE.I will be expensive to implement
and alternate mechanisms are being discussed in the memory model task group. In particular,
for designs that have an incoherent instruction cache and an incoherent data cache, or where
the instruction cache refill does not snoop a coherent data cache, both caches must be completely
flushed when a FENCE.I instruction is encountered. This problem is exacerbated when there are
multiple levels of I and D cache in front of a unified cache or outer memory system.

Second, the instruction is not powerful enough to make available at user level in a Unix-like
operating system environment. The FENCE.I only synchronizes the local hart, and the OS can
reschedule the user hart to a different physical hart after the FENCE.I. This would require the
OS to execute an additional FENCE.I as part of every context migration. For this reason, the
standard Linux ABI has removed FENCE.I from user-level and now requires a system call to
maintain instruction-fetch coherence, which allows the OS to minimize the number of FENCE.I
executions required on current systems and provides forward-compatibility with future improved
instruction-fetch coherence mechanisms.

Future approaches to instruction-fetch coherence under discussion include providing more
restricted versions of FENCE.I that only target a given address specified in rs1, and/or allowing
software to use an ABI that relies on machine-mode cache-maintenance operations."

I start to suspect that even the people working on the riscv memory model have noticed
that letting a single instruction such as FENCE.I take care of both cache coherency
*and* flush the instruction pipeline will be a performance bottleneck, because it
can only clear the whole instruction cache.

Other architectures are either cache-coherent, or have cache flushing which can be
performed on a range of addresses. This is kept apart from whatever instruction
flushes the instruction pipeline of the processor.

By keeping instruction cache flushing separate from instruction pipeline flush, we can
let membarrier (and context switches, including thread migration) only care about the
instruction pipeline part, and leave instruction cache flush to either a dedicated
system call, or to specialized instructions which are available from user-mode.

Considering that FENCE.I is forced to invalidate the whole i-cache, I don't think you
will get away with executing it from switch_mm without making performance go down the
drain on cache incoherent implementations.

In my opinion, what we would need from RISC-V for membarrier (and context switch) is a
lightweight version of FENCE.I which only flushes the instruction pipeline of the local
processor. This should ideally come with a way for architectures with incoherent caches
to flush the relevant address ranges of the i-cache which are modified by a JIT. This
i-cache flush would not be required to flush the instruction pipeline, as it is typical
to batch invalidation of various address ranges together and issue a single instruction
pipeline flush on each CPU at the end. The i-cache flush could either be done by new
instructions available from user-space (similar to aarch64), or through privileged
instructions available through system calls (similar to arm cacheflush).

Thanks for the remarks, Mathieu. I think it will be very helpful to
RISC-V architects (and memory model people) to have this context and
reasoning written down.

One more noteworthy detail: if a system call similar to ARM cacheflush(2) is implemented for
RISC-V, perhaps an iovec ABI (similar to readv(2)/writev(2)) would be relevant to handle
batching of cache flushing when address ranges are not contiguous. Maybe with a new name
like "cacheflushv(2)", so eventually other architectures could implement it as well ?

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com