As you can see, the new function shows much better results even for
the small arrays of 256 elements, therefore I believe it could be a
useful addition to the existing riscv-specific string functions.
Looks good, but do we want to maintain both this version and a zbb
version? I'd expect a zbb version to be even better.
+ add t1, x0, a2
move t1, a2
and for the remainder of the function s/x0/zero/
+ sltiu t2, a2, MIN_BORDER
+ bnez t2, 6f
+
+ // get the number of bytes we should iterate before alignment
I'm not sure, but I think even in assembly we prefer the /* */ comment
format.
+ andi t0, a0, SZREG - 1
+ beqz t0, 4f
+
+ # get the SZREG - t0
I'm 99% sure we don't want to use the # comment syntax.
+ xor t0, t0, SZREG - 1
xori?
+ addi t0, t0, 1
+
+ sub a2, a2, t0
nit: Looks a bit odd to put a blank line above the sub line above,
instead of above the below comment.
+ // iterate before alignment
+1:
+ beq t0, x0, 4f
+ lbu t2, 0(a0)
+ beq t2, a1, 3f
+ addi t0, t0, -1
This addi t0... isn't necessary if we do
add t0, a0, t0
1:
beq a0, t0, 4f
...
...
addi a0, a0, 1
j 1b
+ addi a0, a0, 1
+ j 1b
+
+2:
+ // found a word. Iterate it until we find the target byte
+ li t1, SZREG
+ j 6f
These instructions seem oddly placed among the rest.
+3:
+ ret
And this is an odd place to put this ret (after unconditional jump and
in the middle of the function). We can just put a label at the bottom ret.
+
+4:
+ // get the count remainder
+ andi t1, a2, SZREG - 1
+
+ // align the count
+ sub a2, a2, t1
+
+ // if we have no words to iterate, iterate the remainder
+ beqz a2, 6f
+
+ // from 0xBA we will get 0xBABABABABABABABA
+ li t3, REP_01
+ mul t3, t3, a1
I don't think we want to implement an optimized assembly function with
mul. We can just use a few shifts and ors.
slli t3, a1, 8
or t3, t3, a1
slli t4, t3, 16
or t3, t4, t3
#if __riscv_xlen == 64
slli t4, t3, 32
or t3, t4, t3
#endif
+
+ add a2, a2, a0
+
+ li t4, REP_01
+ li t5, REP_80
+
+5:
+ REG_L t2, 0(a0)
+
+ // after this xor we will get one zero byte in the word if it contains the target byte
+ xor t2, t2, t3
+
+ // word v contains the target byte if (v - 0x01010101) & (~v) & 0x80808080 is positive
s/is positive/is not zero/
+ sub t0, t2, t4
+
+ not t2, t2
+
+ and t0, t0, t2
+ and t0, t0, t5
+
+ bnez t0, 2b
+ addi a0, a0, SZREG
+ bne a0, a2, 5b
+
+6:
+ // iterate the remainder
+ beq t1, x0, 7f
+ lbu t4, 0(a0)
+ beq t4, a1, 3b
+ addi a0, a0, 1
+ addi t1, t1, -1
Same comment as above about being able to drop the addi t1...
+ j 6b
+
+7:
+ addi a0, x0, 0
li a0, 0
+ ret
+SYM_FUNC_END(memchr)
+SYM_FUNC_ALIAS(__pi_memchr, memchr)
--
2.34.1
Thanks,
drew