Re: [RFC PATCH] membarrier: riscv: Provide core serializing command

From: Palmer Dabbelt
Date: Fri Oct 13 2023 - 13:29:41 EST


On Mon, 07 Aug 2023 06:19:18 PDT (-0700), parri.andrea@xxxxxxxxx wrote:
One more noteworthy detail: if a system call similar to ARM cacheflush(2) is implemented for
RISC-V, perhaps an iovec ABI (similar to readv(2)/writev(2)) would be relevant to handle
batching of cache flushing when address ranges are not contiguous. Maybe with a new name
like "cacheflushv(2)", so eventually other architectures could implement it as well ?

I believe that's a sensible idea. But the RISC-V maintainers can provide
a more reliable feedback.

Sorry I missed this, I'm still a bit backlogged from COVID. A few of us were having a meeting, just to try and summarize (many of these points came up in the thread, so sorry for rehashing things):

We don't have a fence.i in the scheduling path, as fence.i is very slow on systems that implement it by flushing the icache. Instead we have a mechanism for deferring the fences (see flush_icache_deferred, though I'm no longer sure that's correct which I'll mention below). As a result userspace can't do a fence.i directly, but instead needs to make a syscall/vdsocall so the kernel can do this bookkeeping. There's some proposals for ISA extensions that replace fence.i, but they're still WIP and there's a lot of fence.i-only hardware so we'll have to deal with it.

When we did this we had a feeling this may be sub-optimal for systems that have faster fence.i implementations (ie, coherent instruction caches), but nobody's gotten around to doing that yet -- and maybe there's no hardware that behaves this way. The rough plan was along the lines of adding a prctl() where userspace can request the ability to directly emit fence.i, which would then result in the kernel eagerly emitting fence.i when scheduling. Some of the Java people have been asking for this sort of feature.

From looking at the membarrier arch/scheduler hooks, I think we might
have a bug in our deferred icache flushing mechanism: specifically we hook into switch_mm(), which this comment has me worried about

* When switching through a kernel thread, the loop in
* membarrier_{private,global}_expedited() may have observed that
* kernel thread and not issued an IPI. It is therefore possible to
* schedule between user->kernel->user threads without passing though
* switch_mm(). Membarrier requires a barrier after storing to
* rq->curr, before returning to userspace, so provide them here:

Even if there's not a bug in the RISC-V stuff, it seems that we've ended up with pretty similar schemes here and we could remove some arch-specific code by de-duplicating things -- IIRC there was no membarrier when we did the original port, so I think we've just missed a cleanup opportunity.

So I'd propose doing the following:

* Pick up a patch like this. Mmaybe exactly this, I'm going to give it a proper review to make sure.
* Remove the RISC-V implemenation of deferred icache flushes and instead just call into membarrier. We might need to add some more bookkeeping here, but from a quick look it seems membarrier is doing pretty much the same thing.
* Implement that prctl that allows userspace to ask for permission to do direct fence.i instructions -- sort of a different project, but if we're going to be tearing into all this code we might as well do it now.

Charlie is volunteering to do the work here, so hopefully we'll have something moving forward.


Andrea