Re: C aggregate passing (Rust kernel policy)
From: Ralf Jung
Date: Tue Mar 04 2025 - 13:47:18 EST
Hi all,
Whether the compiler is permitted to do that depends heavily on what exactly
the code looks like, so it's hard to discuss this in the abstract.
If inside some function, *all* writes to a given location are atomic (I
think that's what you call WRITE_ONCE?), then the compiler is *not* allowed
to invent any new writes to that memory. The compiler has to assume that
there might be concurrent reads from other threads, whose behavior could
change from the extra compiler-introduced writes. The spec (in C, C++, and
Rust) already works like that.
OTOH, the moment you do a single non-atomic write (i.e., a regular "*ptr =
val;" or memcpy or so), that is a signal to the compiler that there cannot
be any concurrent accesses happening at the moment, and therefore it can
(and likely will) introduce extra writes to that memory.
Is that how it really works?
I'd expect the atomic writes to have what we call "compiler barriers"
before and after; IOW, the compiler can do whatever it wants with non
atomic writes, provided it doesn't cross those barriers.
If you do a non-atomic write, and then an atomic release write, that release
write marks communication with another thread. When I said "concurrent accesses
[...] at the moment" above, the details of what exactly that means matter a lot:
by doing an atomic release write, the "moment" has passed, as now other threads
could be observing what happened.
One can get quite far thinking about these things in terms of "barriers" that
block the compiler from reordering operations, but that is not actually what
happens. The underlying model is based on describing the set of behaviors that a
program can have when using particular atomicity orderings (such as release,
acquire, relaxed); the compiler is responsible for ensuring that the resulting
program only exhibits those behaviors. An approach based on "barriers" is one,
but not the only, approach to achieve that: at least in special cases, compilers
can and do perform more optimizations. The only thing that matters is that the
resulting program still behaves as-if it was executed according to the rules of
the language, i.e., the program execution must be captured by the set of
behaviors that the atomicity memory model permits. This set of behaviors is,
btw, completely portable; this is truly an abstract semantics and not tied to
what any particular hardware does.
Now, that's the case for general C++ or Rust. The Linux kernel is special in
that its concurrency support predates the official model, so it is written in a
different style, commonly referred to as LKMM. I'm not aware of a formal study
of that model to the same level of rigor as the C++ model, so for me as a
theoretician it is much harder to properly understand what happens there,
unfortunately. My understanding is that many LKMM operations can be mapped to
equivalent C++ operations (i.e., WRITE_ONCE and READ_ONCE correspond to atomic
relaxed loads and stores). However, the LKMM also makes use of dependencies
(address and/or data dependencies? I am not sure), and unfortunately those
fundamentally clash with even basic compiler optimizations such as GVN/CSE or
algebraic simplifications, so it's not at all clear how they can even be used in
an optimizing compiler in a formally sound way (i.e., "we could, in principle,
mathematically prove that this is correct"). Finding a rigorous way to equip an
optimized language such as C, C++, or Rust with concurrency primitives that emit
the same efficient assembly code as what the LKMM can produce is, I think, an
open problem. Meanwhile, the LKMM seems to work in practice despite those
concerns, and that should apply to both C (when compiled with clang) and Rust in
the same way -- but when things go wrong, the lack of a rigorous contract will
make it harder to determine whether the bug is in the compiler or the kernel.
But again, Rust should behave exactly like clang here, so this should not be a
new concern. :)
Kind regards,
Ralf