Re: [PATCH] riscv: lib: Implement optimized memchr function

From: Ivan Orlov
Date: Tue Apr 02 2024 - 09:12:53 EST

Next message: Matti Vaittinen: "[RFC PATCH 3/6] mfd: support ROHM BD96801 PMIC core"
Previous message: Matti Vaittinen: "[RFC PATCH 2/6] dt-bindings: mfd: bd96801 PMIC core"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 27/03/2024 14:21, Palmer Dabbelt wrote:

On Mon, 11 Dec 2023 07:25:15 PST (-0800), ivan.orlov@xxxxxxxxxxxxxxx wrote:

On 11/12/2023 15:08, Andrew Jones wrote:

As you can see, the new function shows much better results even for
the small arrays of 256 elements, therefore I believe it could be a
useful addition to the existing riscv-specific string functions.

Looks good, but do we want to maintain both this version and a zbb
version? I'd expect a zbb version to be even better.

Hi Andrew,

Yes, ZBB analog would be much better, and if we use ZBB operations we
could avoid the most part of bit magic happening there.

+    add t1, x0, a2

move t1, a2

and for the remainder of the function s/x0/zero/

Alright, will be fixed in the next version.

+    sltiu t2, a2, MIN_BORDER
+    bnez t2, 6f
+
+    // get the number of bytes we should iterate before alignment

I'm not sure, but I think even in assembly we prefer the /* */ comment
format.

+    andi t0, a0, SZREG - 1
+    beqz t0, 4f
+
+    # get the SZREG - t0

I'm 99% sure we don't want to use the # comment syntax.

+    xor t0, t0, SZREG - 1

xori?

Hmm, I'm surprised that it is actually compilable... Yeah, should be fixed
+    addi t0, t0, 1
+
+    sub a2, a2, t0

nit: Looks a bit odd to put a blank line above the sub line above,
instead of above the below comment.

+    // iterate before alignment
+1:
+    beq t0, x0, 4f
+    lbu t2, 0(a0)
+    beq t2, a1, 3f
+    addi t0, t0, -1

This addi t0... isn't necessary if we do

Yeah, sounds reasonable, we can make it faster

    add t0, a0, t0
1:
    beq a0, t0, 4f
    ...
    ...
    addi a0, a0, 1
    j 1b

+    addi a0, a0, 1
+    j 1b
+
+2:
+    // found a word. Iterate it until we find the target byte
+    li t1, SZREG
+    j 6f

These instructions seem oddly placed among the rest.

+3:
+    ret

And this is an odd place to put this ret (after unconditional jump and
in the middle of the function). We can just put a label at the bottom ret.

I agree, thanks!

+
+4:
+    // get the count remainder
+    andi t1, a2, SZREG - 1
+
+    // align the count
+    sub a2, a2, t1
+
+    // if we have no words to iterate, iterate the remainder
+    beqz a2, 6f
+
+    // from 0xBA we will get 0xBABABABABABABABA
+    li t3, REP_01
+    mul t3, t3, a1

I don't think we want to implement an optimized assembly function with
mul. We can just use a few shifts and ors.

    slli    t3, a1, 8
    or    t3, t3, a1
    slli    t4, t3, 16
    or    t3, t4, t3
#if __riscv_xlen == 64
    slli    t4, t3, 32
    or    t3, t4, t3
#endif

Nice point, thanks! Will be optimized :)

+
+    add a2, a2, a0
+
+    li t4, REP_01
+    li t5, REP_80
+
+5:
+    REG_L t2, 0(a0)
+
+    // after this xor we will get one zero byte in the word if it contains the target byte
+    xor t2, t2, t3
+
+    // word v contains the target byte if (v - 0x01010101) & (~v) & 0x80808080 is positive

s/is positive/is not zero/

+    sub t0, t2, t4
+
+    not t2, t2
+
+    and t0, t0, t2
+    and t0, t0, t5
+
+    bnez t0, 2b
+    addi a0, a0, SZREG
+    bne a0, a2, 5b
+
+6:
+    // iterate the remainder
+    beq t1, x0, 7f
+    lbu t4, 0(a0)
+    beq t4, a1, 3b
+    addi a0, a0, 1
+    addi t1, t1, -1

Same comment as above about being able to drop the addi t1...

+    j 6b
+
+7:
+    addi a0, x0, 0

li a0, 0

+    ret
+SYM_FUNC_END(memchr)
+SYM_FUNC_ALIAS(__pi_memchr, memchr)
--
2.34.1

Thanks,
drew

Thanks a lot for the review!

Do you have a v2? Sorry if I lost it.

Hi Palmer,

Sorry for the late reply.

After a few experiments it became clear that we won't get such a large performance gain for the xlen=32. Also, I collected some usage statistics on the system, and it shown that `memchr` has to iterate more than 128 bytes quite infrequently.

Considering this information, it seems to me that such an overcomplication of the `memchr` function simply doesn't worth it. So, there was no V2 for this patch :(

Sorry, I should've written it earlier.

--
Kind regards,
Ivan Orlov

Next message: Matti Vaittinen: "[RFC PATCH 3/6] mfd: support ROHM BD96801 PMIC core"
Previous message: Matti Vaittinen: "[RFC PATCH 2/6] dt-bindings: mfd: bd96801 PMIC core"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]