Re: [PATCH] riscv: lib: Optimize 'strlen' function

From: Ivan Orlov
Date: Sun Dec 17 2023 - 17:54:55 EST

Next message: alexandre . belloni: "[PATCH] rtc: ma35d1: remove hardcoded UIE support"
Previous message: Andrew Lunn: "Re: [PATCH 02/12 net-next] qca_spi: Improve SPI IRQ handling"
In reply to: David Laight: "RE: [PATCH] riscv: lib: Optimize 'strlen' function"
Next in thread: Ivan Orlov: "Re: [PATCH] riscv: lib: Optimize 'strlen' function"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 12/17/23 17:00, David Laight wrote:

From: Ivan Orlov

Sent: 13 December 2023 15:46

The current non-ZBB implementation of 'strlen' function iterates the
memory bytewise, looking for a zero byte. It could be optimized to use
the wordwise iteration instead, so we will process 4/8 bytes of memory
at a time.

...

1. If the address is unaligned, iterate SZREG - (address % SZREG) bytes
to align it.

An alternative is to mask the address and 'or' in non-zero bytes
into the first word - might be faster.

Hi David,

Yeah, it might be an option, I'll test it. Thanks!

...

Here you can find the benchmarking results for the VisionFive2 board
comparing the old and new implementations of the strlen function.

Size: 1 (+-0), mean_old: 673, mean_new: 666
Size: 2 (+-0), mean_old: 672, mean_new: 676
Size: 4 (+-0), mean_old: 685, mean_new: 659
Size: 8 (+-0), mean_old: 682, mean_new: 673
Size: 16 (+-0), mean_old: 718, mean_new: 694

...

Is that 32bit or 64bit?
The word-at-a-time strlen() is typically not worth it for 32bit.

I tested it on 64-bit board only as it is the only board I have...

I assume the performance gain would be less noticeable on 32bit, probably the word-oriented function could be even slower than the byte-oriented one for shorter strings.

However, I'm not sure if any physical 32-bit risc-v boards with Linux support actually exist at the moment... So the only way to test the solution on the 32-bit system would be QEMU, and probably it wouldn't be really representative, right?

But it definitely worth a try and probably I could include a separate implementation for 32-bit RISC-V which will simply iterate the bytes in case if QEMU 32-bit test will show significant overhead for word-oriented function.

I'd also guess that pretty much all the calls in-kernel are short.

I'm 99% sure they are! However, I believe if word-oriented solution doesn't introduce performance overhead for shorter strings but works much faster for longer strings, it still worth an implementation! :)

You might try counting as: histogram[ilog2(strlen_result)]++
and seeing what it shows for some workload.
I bet you (a beer if I see you!) that you won't see many over 1k.

Sounds like a funny experiment, and I accept a bet! Beer is more than doable as I'm also located in the UK (Manchester).

--
Kind regards,
Ivan Orlov

Next message: alexandre . belloni: "[PATCH] rtc: ma35d1: remove hardcoded UIE support"
Previous message: Andrew Lunn: "Re: [PATCH 02/12 net-next] qca_spi: Improve SPI IRQ handling"
In reply to: David Laight: "RE: [PATCH] riscv: lib: Optimize 'strlen' function"
Next in thread: Ivan Orlov: "Re: [PATCH] riscv: lib: Optimize 'strlen' function"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]