RE: [PATCH] riscv: lib: Optimize 'strlen' function

From: David Laight
Date: Sun Dec 17 2023 - 13:11:56 EST


From: Ivan Orlov
> Sent: 13 December 2023 15:46

Looking at the old code...

> 1:
> - lbu t0, 0(t1)
> - beqz t0, 2f
> - addi t1, t1, 1
> - j 1b

I suspect there is (at least) a two clock stall between
the 'ldu' and 'beqz'.
Allowing for one clock for the 'predicted taken' branch
that is 7 clocks/byte.

Try this one - especially on 32bit:

mov t0, a0
and t1, t0, 1
sub t0, t0, t1
bnez t1, 2f
1:
ldb t1, 0(t0)
2: ldb t2, 1(t0)
add t0, t0, 2
beqz t1, 3f
bnez t2, 1b
add t0, t0, 1
3: sub t0, t0, 2
sub a0, t0, a0
ret

Might be 6 clocks for 2 bytes.
The much smaller cache footprint will also help.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)