I am not aware of a comparable tool that would be in wide-spread use, or that is
carefully aligned with the semantics of an actual compiler.
For C, there is Cerberus (https://www.cl.cam.ac.uk/~pes20/cerberus/) as an
executable version of the C specification, but it can only run tiny examples.
The verified CompCert compiler comes with a semantics one could interpret, but
that only checks code for compatibility with CompCert C, which has a lot less
(and a bit more) UB than real C.
There are also two efforts that turned into commercial tools that I have not
tried, and for which there is hardly any documentation of how they interpret the
C standard so it's not clear what a green light from them means when compiling
with gcc or clang. I also don't know how much real-world code they can actually run.
- TrustInSoft/tis-interpreter, mostly gone from the web but still available in
the wayback machine
(https://web.archive.org/web/20200804061411/https://github.com/TrustInSoft/tis-interpreter/);
I assume this got integrated into their "TrustInSoft Analyzer" product.
- kcc, a K-framework based formalization of C that is executable. The public
repo is dead (https://github.com/kframework/c-semantics) and when I tried to
build their tool that didn't work. The people behind this have a company that
offers "RV-Match" as a commercial product claiming to find bugs in C based on "a
complete formal ISO C11 semantics" so I guess that is where their efforts go now.
For C++ and Zig, I am not aware of anything comparable.
Part of the problem is that in C, 2 people will have 3 ideas for what the
standard means. Compiler writers and programmers regularly have wildly
conflicting ideas of what is and is not allowed. There are many different places
in the standard that have to be scanned to answer "is this well-defined" even
for very simple programs. (https://godbolt.org/z/rjaWc6EzG is one of my favorite
examples.) A tool can check a single well-defined semantics, but who gets to
decide what exactly those semantics are?
Formalizing the C standard requires extensive interpretation, so I am skeptical
of everyone who claims that they "formalized the C standard" and built a tool on
that without extensive evaluation of how their formalization compares to what
compilers do and what programmers rely on. The Cerberus people have done that
evaluation (see e.g. https://dl.acm.org/doi/10.1145/2980983.2908081), but none
of the other efforts have (to my knowledge). Ideally such a formalization effort
would be done in close collaboration with compiler authors and the committee so
that the ambiguities in the standard can be resolved and the formalization
becomes the one canonical interpretation. The Cerberus people are the ones that
pushed the C provenance formalization through, so they made great progress here.
However, many issues remain, some quite long-standing (e.g.
https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_260.htm and
https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_451.htm, which in my eyes
never got properly resolved by clarifying the standard). Martin and a few others
are slowly pushing things in the right direction, but it takes a long time.
Rust, by having a single project in charge of the one canonical implementation
and the specification, and having an open process that is well-suited for
incorporating user concerns, can move a lot quicker here. C has a huge
head-start, Rust has nothing like the C standard, but we are catching up -- and
our goal is more ambitious than that; we are doing our best to learn from C and
C++ and concluded that that style of specification is too prone to ambiguity, so
we are trying to achieve a formally precise unambiguous specification. Wasm
shows that this can be done, at industry scale, albeit for a small language --
time we do it for a large one. :)
So, yes I think Miri is fairly unique. But please let me know if I missed something!
(As an aside, the above hopefully also explains why some people in Rust are
concerned about alternative implementations. We do *not* want the current
de-factor behavior to ossify and become the specification. We do *not* want the
specification to just be a description of what the existing implementations at
the time happen to do, and declare all behavior differences to be UB or
unspecified or so just because no implementation is willing to adjust their
behavior to match the rest. We want the specification to be prescriptive, not
descriptive, and we will adjust the implementation as we see fit to achieve that
-- within the scope of Rust's stability guarantees. That's also why we are so
cagey about spelling out the aliasing rules until we are sure we have a good
enough model.)
Very interesting, thank you for the exhaustive answer.
Might it be accurate to categorize Miri as a
"formal-semantics-based undefined-behavior-detecting interpreter"?
https://godbolt.org/z/rjaWc6EzG
That example uses a compiler-specific attribute AFAIK, namely
__attribute__((noinline))
When using compiler-specific attributes and options, the
original language is arguably no longer being used, depending
on the attribute. Though a language being inexpressive and
possibly requiring compiler extensions to achieve some goals,
possibly like in this C example, can be a disadvantage in itself.
[On formalization]
I agree that Rust has some advantages in regards to formalization,
but some of them that I think of, are different from what you
mention here. And I also see some disadvantages.
C is an ancient language, and parsing and handling C is made
more complex by the preprocessor. Rust is a much younger
language that avoided all that pain, and is easier to parse
and handle. C++ is way worse, though might become closer
to Rust with C++ modules.
Rust is more willing to break existing code in projects, causing
previously compiling projects to no longer compile. rustc does this
rarely, but it has happened, also long after Rust 1.0.
From last year, 2024.
https://internals.rust-lang.org/t/type-inference-breakage-in-1-80-has-not-been-handled-well/21374
"Rust 1.80 broke builds of almost all versions of the
very popular time crate (edit: please don't shoot the
messenger in that GitHub thread!!!)
Rust has left only a 4-month old version working.
That was not enough time for the entire Rust
ecosystem to update, especially that older
versions of time were not yanked, and users
had no advance warning that it will stop working.
A crater run found a regression in over 5000 crates,
and that has just been accepted as okay without
any further action! This is below the level of stability
and reliability that Rust should have."
If C was willing to break code as much as Rust, it would be easier to
clean up C.
[Omitted] We do *not* want the
specification to just be a description of what the existing implementations at
the time happen to do, and declare all behavior differences to be UB or
unspecified or so just because no implementation is willing to adjust their
behavior to match the rest. [Omitted]
I have seen some Rust proponents literally say that there is
a specification for Rust, and that it is called rustc/LLVM.
Though those specific individuals may not have been the
most credible individuals.
A fear I have is that there may be hidden reliance in
multiple different ways on LLVM, as well as on rustc.
Maybe even very deeply so. The complexity of Rust's
type system and rustc's type system checking makes
me more worried about this point. If there are hidden
elements, they may turn out to be very difficult to fix,
especially if they are discovered to be fundamental.
While having one compiler can be an advantage in
some ways, it can arguably be a disadvantage
in some other ways, as you acknowledge as well
if I understand you correctly.
You mention ossifying, but the more popular Rust becomes,
the more painful breakage will be, and the less suited
Rust will be as a research language.
Does Crater run Rust for Linux and relevant Rust
kernel code?
I hope that any new language at least has its
language developers ensure that they have a type
system that is formalized and proven correct
before that langauge's 1.0 release.
Since fixing a type system later can be difficult or
practically impossible. A complex type system
and complex type checking can be a larger risk in this
regard relative to a simple type system and simple
type checking, especially the more time passes and
the more the language is used and have code
written in it, making it more difficult to fix the language
due to code breakage costing more.
There are some issues in Rust that I am curious as to
your views on. rustc or the Rust language has some type
system holes, which still causes problems for rustc and
their developers.
https://github.com/lcnr/solver-woes/issues/1
https://github.com/rust-lang/rust/issues/75992
Those kinds of issues seem difficult to solve.
In your opinion, is it accurate to say that the Rust language
developers are working on a new type system for
Rust-the-language and a new solver for rustc, and that
they are trying to make the new type system and new solver
as backwards compatible as possible?
It's not really a new type system. It's a new implementation for the same type
system. But yes there is work on a new "solver" (that I am not involved in) that
should finally fix some of the long-standing type system bugs. Specifically,
this is a "trait solver", i.e. it is the component responsible for dealing with
trait constraints. Due to some unfortunate corner-case behaviors of the old,
organically grown solver, it's very hard to do this in a backwards-compatible
way, but we have infrastructure for extensive ecosystem-wide testing to judge
the consequences of any given potential breaking change and ensure that almost
all existing code keeps working. In fact, Rust 1.84 already started using the
new solver for some things
(https://blog.rust-lang.org/2025/01/09/Rust-1.84.0.html) -- did you notice?
Hopefully not. :)
If it is not a new type system, why then do they talk about
backwards compatibility for existing Rust projects?
If the type system is not changed, existing projects would
still type check. And in this repository of one of the main
Rust language developers as I understand it, several
issues are labeled with "S-fear".
https://github.com/lcnr/solver-woes/issues
They have also been working on this new solver for
several years. Reading through the issues, a lot of
the problems seem very hard.