Re: Allow data races on some read/write operations

From: Ralf Jung
Date: Tue Mar 18 2025 - 10:57:56 EST


Hi all,

I
may even later copy the data at place B to place C where C might have
concurrent reads and/or writes, and the kernel will not experience UB
because of this. The data may be garbage, but that is fine. I am not
interpreting the data, or making control flow decisions based on it. I
am just moving the data.

My understand is: In Rust, this program would be illegal and might
experience UB in unpredictable ways, not limited to just the data that
is being moved.

That is correct. C and Rust behave the same here.

Is there a difference between formal models of the languages and
practical implementations of the languages here? I'm asking this because
C kernel developers seem to be writing these programs that are illegal
under the formal spec of the C language, but work well in practice.
Could it be the same in Rust?

That is, can I do this copy and get away with it in practice under the
circumstances outlined earlier?

As with off-label drug usage, things can of course go well even if you deliberately leave the range of well-defined usage defined by the manufacturer.
However, answering your question conclusively requires intimate knowledge of the entire compilation chain. I'm not even sure if there's a single person that has everything from front-end transformations to back-end lowering in their head...
At the scale that compilers have reached, I think we have to compartmentalize by establishing abstractions (such as the Rust / C language specs, and the LLVM IR language spec). This enables each part of the compiler to locally ensure their consistency with the spec (hopefully that one part still fits in one person's head), and as long as everyone uses the same spec and interprets it the same way, we achieve a consistent end-to-end result from many individually consistent pieces.

Personally my goal has always been to identify the cases where programmers deliberately reach for such off-label usage, figure out the missing parts in the language that motivate them to do this, and add them, so that we can move on having everything on solid footing. :) I did not realize that atomic memcpy is so crucial for the kernel, but it makes sense in hindsight. So IMO that is where we should spend our effort, rather than digging through the entire compilation pipeline to determine some works-in-practice off-label alternative.

One option I have explored is just calling C memcpy directly, but
because of LTO, that is no different than doing the operation in Rust.

I don't think I need atomic memcpy, I just need my program not to
explode if I move some data to or from a place that is experiencing
concurrent writes without synchronization. Not in general, but for some
special cases where I promise not to look at the data outside of moving
it.

I'm afraid I do not know of a language, other than assembly, that can provide this.

Atomic memcpy, however, should be able to cover your use-case, so it seems like
a reasonable solution to me? Marking things as atomic is literally how you tell
the compiler "don't blow up if there are concurrent accesses".

If atomic memcpy is what we really need to write these kinds of programs in
Rust, what would be the next steps to get this in the language?

There is an RFC, but it has been stalled for a while: <https://github.com/rust-lang/rfcs/pull/3301>. I do not know its exact status. It might be blocked on having this in the C++ model, though at least unstable experimentation should be possible before C++ has fully standardized the way this will look. (We'll want to ensure consistency of the C++ and Rust models here to ensure that C, C++, and Rust can interop on shared memory in a coherent way.)
On the C++ side (where the atomic memcpy would likely be added to the concurrency memory model, to be then adopted by C and Rust), I heard there was a lot of non-technical trouble due to ISO changing their procedural rules for how they wanted changes to the standard to look like. I don't know any further details here as I am not directly involved.

Also, would there be a performance price to pay for this?

I know little about evaluating performance at the low-level architectural or even microarchitectural level. However I would think in the end the memcpy itself (when using the "relaxed" atomic ordering) would be the same existing operation, the same assembly, it is just treated differently by optimizations before reaching the assembly stage.

Kind regards,
Ralf



Best regards,
Andreas Hindborg