RE: [PATCH v2] LoongArch: add checksum optimization for 64-bit system

From: David Laight
Date: Tue Feb 14 2023 - 04:48:13 EST


From: maobibo
> Sent: 14 February 2023 01:31
...
> Part of asm code depends on previous intr in website
> https://github.com/loongson/linux/commit/92a6df48ccb73dd2c3dc1799add08adf0e0b0deb,
> such as macro ADDC
> #define ADDC(sum,reg) \
> ADD sum, sum, reg; \
> sltu t8, sum, reg; \
> ADD sum, sum, t8; \
> these three instructions depends on each other, and can not execute
> in parallel.

Right, but you can add the carry bits into a different register.
Since the aim is 8 bytes/clock limited by 1 memory read/clock
you can (probably) manage with all the word adds going to one
register and all the carry adds to a second. So:
#define ADDC(carry, sum, reg) \
add sum, sum, reg \
sltu reg, sum, reg \
add carry, carry, reg

>
> The original of main loop about Lmove_128bytes is:
> #define CSUM_BIGCHUNK(src, offset, sum, _t0, _t1, _t2, _t3) \
> LOAD _t0, src, (offset + UNIT(0)); \
> LOAD _t1, src, (offset + UNIT(1)); \
> LOAD _t2, src, (offset + UNIT(2)); \
> LOAD _t3, src, (offset + UNIT(3)); \
> ADDC(_t0, _t1); \
> ADDC(_t2, _t3); \
> ADDC(sum, _t0); \
> ADDC(sum, _t2)
>
> .Lmove_128bytes:
> CSUM_BIGCHUNK(src, 0x00, sum, t0, t1, t3, t4)
> CSUM_BIGCHUNK(src, 0x20, sum, t0, t1, t3, t4)
> CSUM_BIGCHUNK(src, 0x40, sum, t0, t1, t3, t4)
> CSUM_BIGCHUNK(src, 0x60, sum, t0, t1, t3, t4)
> addi.d t5, t5, -1
> addi.d src, src, 0x80
> bnez t5, .Lmove_128bytes
>
> I modified the main loop with label .Lmove_128bytes to reduce
> dependency between instructions like this, it can improve the
> performance.
> can improve the performance.
> .Lmove_128bytes:
> LOAD t0, src, 0
> LOAD t1, src, 8
> LOAD t3, src, 16
> LOAD t4, src, 24
> LOAD a3, src, 0 + 0x20
> LOAD a4, src, 8 + 0x20
> LOAD a5, src, 16 + 0x20
> LOAD a6, src, 24 + 0x20
> ADD t0, t0, t1
> ADD t3, t3, t4
> ADD a3, a3, a4
> ADD a5, a5, a6
> sltu t8, t0, t1
> sltu a7, t3, t4
> ADD t0, t0, t8
> ADD t3, t3, a7
> sltu t1, a3, a4
> sltu t4, a5, a6
> ADD a3, a3, t1
> ADD a5, a5, t4
> ADD t0, t0, t3
> ADD a3, a3, a5
> sltu t1, t0, t3
> sltu t4, a3, a5
> ADD t0, t0, t1
> ADD a3, a3, t4
> ADD sum, sum, t0
> sltu t8, sum, t0
> ADD sum, sum, t8
> ADD sum, sum, a3
> sltu t8, sum, a3
> addi.d t5, t5, -1
> ADD sum, sum, t8
>
> However the result and principle is almost the similar with
> uint128 c code. And there is no performance impact interleaving
> the reads and alu operations.

You are still relying on the 'out of order' logic to execute
ALU instructions while the memory reads are going on.
Try something like:
complex setup :-)
loop:
sltu c0, sum, v0
load v0, src, 0
add sum, v1
add carry, c3

sltu c1, sum, v1
load v1, src, 8
add sum, v2
add carry, c0

sltu c2, sum, v2
load v2, src, 16
addi src, 32
add sum, v3
add carry, c1

sltu c3, sum, v3
load v3, src, 24
add sum, v0
add carry, c2
bne src, limit, loop

complex finalise

The idea being that each group of instructions executes
in one clock - so the loop is 4 clocks.
The above code allows for 2 delay clocks on reads.
They may not be needed, in that case the above may run
at 8 bytes/clock with just 2 blocks of instructions.

You'd give the cpu a bit more leeway by using two sum and
carry registers.

I'd time the loop without worrying about the setup/finalise
code.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)