From: Akira Tsukamoto
Sent: 04 June 2021 10:57
Reducing pipeline stall of read after write (RAW).
These are the results from combination of the speedup with
Gary's misalign fix. Speeds up from 680Mbps to 900Mbps.
Before applying these two patches.
I think the changes should be in separate patches.
Otherwise it is difficult to see what is relevant.
It also looks as if there is a register rename.
Maybe that should be a precursor patch?
...
I think this is the old main copy loop:
1:and this is the new one:
- fixup REG_L, t2, (a1), 10f
- fixup REG_S, t2, (a0), 10f
- addi a1, a1, SZREG
- addi a0, a0, SZREG
- bltu a1, t1, 1b
3:
+ fixup REG_L a4, 0(a1), 10f
+ fixup REG_L a5, SZREG(a1), 10f
+ fixup REG_L a6, 2*SZREG(a1), 10f
+ fixup REG_L a7, 3*SZREG(a1), 10f
+ fixup REG_L t0, 4*SZREG(a1), 10f
+ fixup REG_L t1, 5*SZREG(a1), 10f
+ fixup REG_L t2, 6*SZREG(a1), 10f
+ fixup REG_L t3, 7*SZREG(a1), 10f
+ fixup REG_S a4, 0(t5), 10f
+ fixup REG_S a5, SZREG(t5), 10f
+ fixup REG_S a6, 2*SZREG(t5), 10f
+ fixup REG_S a7, 3*SZREG(t5), 10f
+ fixup REG_S t0, 4*SZREG(t5), 10f
+ fixup REG_S t1, 5*SZREG(t5), 10f
+ fixup REG_S t2, 6*SZREG(t5), 10f
+ fixup REG_S t3, 7*SZREG(t5), 10f
+ addi a1, a1, 8*SZREG
+ addi t5, t5, 8*SZREG
+ bltu a1, a3, 3b
I don't know the architecture, but unless there is a stunning
pipeline delay for memory reads a simple interleaved copy
may be fast enough.
So something like:
a = src[0];
do {
b = src[1];
src += 2;
dst[0] = a;
dst += 2;
a = src[0];
dst[-1] = b;
} while (src != src_end);
dst[0] = a;
It is probably worth doing benchmarks of the copy loop
in userspace.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)