Re: C aggregate passing (Rust kernel policy)

From: David Laight
Date: Sat Feb 22 2025 - 17:13:29 EST


On Sat, 22 Feb 2025 16:22:08 -0500
Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:

> On Sat, Feb 22, 2025 at 12:54:31PM -0800, H. Peter Anvin wrote:
> > VLIW and OoO might seem orthogonal, but they aren't – because they are
> > trying to solve the same problem, combining them either means the OoO
> > engine can't do a very good job because of false dependencies (if you
> > are scheduling molecules) or you have to break them instructions down
> > into atoms, at which point it is just a (often quite inefficient) RISC
> > encoding. In short, VLIW *might* make sense when you are statically
> > scheduling a known pipeline, but it is basically a dead end for
> > evolution – so unless you can JIT your code for each new chip
> > generation...
>
> JITing for each chip generation would be a part of any serious new VLIW
> effort. It's plenty doable in the open source world and the gains are
> too big to ignore.

Doesn't most code get 'dumbed down' to whatever 'normal' ABI compilers
can easily handle.
A few hot loops might get optimised, but most code won't be.
Of course AI/GPU code is going to spend a lot of time in some tight loops.
But no one is going to go through the TCP stack and optimise the source
so that a compiler can make a better job of it for 'this years' cpu.

For various reasons ended up writing a simple 32bit cpu last year (in VHDL for an fgpa).
The ALU is easy - just a big MUX.
The difficulty is feeding the result of one instruction into the next.
Normal code needs to do that all the time, you can't afford a stall
(never mind the 3 clocks writing to/from the register 'memory' would take).
In fact the ALU dependencies [1] ended up being slower than the instruction fetch
code, so I managed to take predicted and unconditional branches without a stall.
So no point having the 'branch delay slot' of sparc32.
[1] multiply was the issue, even with a pipeline stall if the result has needed.
In any case it only had to run at 62.5MHz (related to the PCIe speed).

Was definitely an interesting exercise.

David