From: maobiboGot it. It makes use of pipeline better, rather than number of ALUs for different micro-architectures. I will try this method, thanks again for kindly help and explanation with patience.
Sent: 14 February 2023 01:31...
Part of asm code depends on previous intr in website
https://github.com/loongson/linux/commit/92a6df48ccb73dd2c3dc1799add08adf0e0b0deb,
such as macro ADDC
#define ADDC(sum,reg) \
ADD sum, sum, reg; \
sltu t8, sum, reg; \
ADD sum, sum, t8; \
these three instructions depends on each other, and can not execute
in parallel.
Right, but you can add the carry bits into a different register.
Since the aim is 8 bytes/clock limited by 1 memory read/clock
you can (probably) manage with all the word adds going to one
register and all the carry adds to a second. So:
#define ADDC(carry, sum, reg) \
add sum, sum, reg \
sltu reg, sum, reg \
add carry, carry, reg
The original of main loop about Lmove_128bytes is:
#define CSUM_BIGCHUNK(src, offset, sum, _t0, _t1, _t2, _t3) \
LOAD _t0, src, (offset + UNIT(0)); \
LOAD _t1, src, (offset + UNIT(1)); \
LOAD _t2, src, (offset + UNIT(2)); \
LOAD _t3, src, (offset + UNIT(3)); \
ADDC(_t0, _t1); \
ADDC(_t2, _t3); \
ADDC(sum, _t0); \
ADDC(sum, _t2)
.Lmove_128bytes:
CSUM_BIGCHUNK(src, 0x00, sum, t0, t1, t3, t4)
CSUM_BIGCHUNK(src, 0x20, sum, t0, t1, t3, t4)
CSUM_BIGCHUNK(src, 0x40, sum, t0, t1, t3, t4)
CSUM_BIGCHUNK(src, 0x60, sum, t0, t1, t3, t4)
addi.d t5, t5, -1
addi.d src, src, 0x80
bnez t5, .Lmove_128bytes
I modified the main loop with label .Lmove_128bytes to reduce
dependency between instructions like this, it can improve the
performance.
can improve the performance.
.Lmove_128bytes:
LOAD t0, src, 0
LOAD t1, src, 8
LOAD t3, src, 16
LOAD t4, src, 24
LOAD a3, src, 0 + 0x20
LOAD a4, src, 8 + 0x20
LOAD a5, src, 16 + 0x20
LOAD a6, src, 24 + 0x20
ADD t0, t0, t1
ADD t3, t3, t4
ADD a3, a3, a4
ADD a5, a5, a6
sltu t8, t0, t1
sltu a7, t3, t4
ADD t0, t0, t8
ADD t3, t3, a7
sltu t1, a3, a4
sltu t4, a5, a6
ADD a3, a3, t1
ADD a5, a5, t4
ADD t0, t0, t3
ADD a3, a3, a5
sltu t1, t0, t3
sltu t4, a3, a5
ADD t0, t0, t1
ADD a3, a3, t4
ADD sum, sum, t0
sltu t8, sum, t0
ADD sum, sum, t8
ADD sum, sum, a3
sltu t8, sum, a3
addi.d t5, t5, -1
ADD sum, sum, t8
However the result and principle is almost the similar with
uint128 c code. And there is no performance impact interleaving
the reads and alu operations.
You are still relying on the 'out of order' logic to execute
ALU instructions while the memory reads are going on.
Try something like:
complex setup :-)
loop:
sltu c0, sum, v0
load v0, src, 0
add sum, v1
add carry, c3
sltu c1, sum, v1
load v1, src, 8
add sum, v2
add carry, c0
sltu c2, sum, v2
load v2, src, 16
addi src, 32
add sum, v3
add carry, c1
sltu c3, sum, v3
load v3, src, 24
add sum, v0
add carry, c2
bne src, limit, loop
complex finalise
The idea being that each group of instructions executes
in one clock - so the loop is 4 clocks.
The above code allows for 2 delay clocks on reads.
They may not be needed, in that case the above may run
at 8 bytes/clock with just 2 blocks of instructions.
You'd give the cpu a bit more leeway by using two sum and
carry registers.
I'd time the loop without worrying about the setup/finalise
code.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)