Re: [RFC PATCH] membarrier: riscv: Provide core serializing command
From: Palmer Dabbelt
Date: Fri Oct 13 2023 - 13:29:41 EST
On Mon, 07 Aug 2023 06:19:18 PDT (-0700), parri.andrea@xxxxxxxxx wrote:
One more noteworthy detail: if a system call similar to ARM cacheflush(2) is implemented for
RISC-V, perhaps an iovec ABI (similar to readv(2)/writev(2)) would be relevant to handle
batching of cache flushing when address ranges are not contiguous. Maybe with a new name
like "cacheflushv(2)", so eventually other architectures could implement it as well ?
I believe that's a sensible idea. But the RISC-V maintainers can provide
a more reliable feedback.
Sorry I missed this, I'm still a bit backlogged from COVID. A few of us
were having a meeting, just to try and summarize (many of these points
came up in the thread, so sorry for rehashing things):
We don't have a fence.i in the scheduling path, as fence.i is very slow
on systems that implement it by flushing the icache. Instead we have a
mechanism for deferring the fences (see flush_icache_deferred, though
I'm no longer sure that's correct which I'll mention below). As a
result userspace can't do a fence.i directly, but instead needs to make
a syscall/vdsocall so the kernel can do this bookkeeping. There's some
proposals for ISA extensions that replace fence.i, but they're still WIP
and there's a lot of fence.i-only hardware so we'll have to deal with
it.
When we did this we had a feeling this may be sub-optimal for systems
that have faster fence.i implementations (ie, coherent instruction
caches), but nobody's gotten around to doing that yet -- and maybe
there's no hardware that behaves this way. The rough plan was along the
lines of adding a prctl() where userspace can request the ability to
directly emit fence.i, which would then result in the kernel eagerly
emitting fence.i when scheduling. Some of the Java people have been
asking for this sort of feature.
From looking at the membarrier arch/scheduler hooks, I think we might
have a bug in our deferred icache flushing mechanism: specifically we
hook into switch_mm(), which this comment has me worried about
* When switching through a kernel thread, the loop in
* membarrier_{private,global}_expedited() may have observed that
* kernel thread and not issued an IPI. It is therefore possible to
* schedule between user->kernel->user threads without passing though
* switch_mm(). Membarrier requires a barrier after storing to
* rq->curr, before returning to userspace, so provide them here:
Even if there's not a bug in the RISC-V stuff, it seems that we've ended
up with pretty similar schemes here and we could remove some
arch-specific code by de-duplicating things -- IIRC there was no
membarrier when we did the original port, so I think we've just missed a
cleanup opportunity.
So I'd propose doing the following:
* Pick up a patch like this. Mmaybe exactly this, I'm going to give it
a proper review to make sure.
* Remove the RISC-V implemenation of deferred icache flushes and instead
just call into membarrier. We might need to add some more bookkeeping
here, but from a quick look it seems membarrier is doing pretty much
the same thing.
* Implement that prctl that allows userspace to ask for permission to do
direct fence.i instructions -- sort of a different project, but if
we're going to be tearing into all this code we might as well do it
now.
Charlie is volunteering to do the work here, so hopefully we'll have
something moving forward.
Andrea