Re: [PATCH next] string: Optimise strlen()

From: David Laight

Date: Sun Apr 19 2026 - 06:41:46 EST


On Fri, 27 Mar 2026 19:57:37 +0000
david.laight.linux@xxxxxxxxx wrote:

> From: David Laight <david.laight.linux@xxxxxxxxx>
>
> Unrolling the loop once significantly improves performance on some CPU.
> Userspace testing on a Zen-5 shows it runs at two bytes/clock rather than
> one byte/clock with only a marginal additional overhead.

I hate benchmarks.

I've finally got around to looking at this again (on x86-64).
I changed the order of the 'single byte' and 'two byte' tests and the
'two byte' loop slowed down massively - to pretty much the same speed
as the 'single byte' loop.
gcc had swapped over the two functions in the object file.
Swapping the order changed the alignment of the loop top between odd and
even multiples of 16 (this alignment is disabled in kernel to avoid bloat).
The loop in the 'two byte' code is 17 bytes, in the slow case the loop
top is aligned to an odd boundary so that the last byte is in a different
32 byte code block - which is presumably slow.
Changing the two 'cmpb $0, mem' to (say) 'cmpb %cl, mem' would reduce
the loop to 15 bytes and so wouldn't cross a 16 byte boundary.
(The 'single byte' loop doesn't cross a 16 byte boundary in the test program.)

The kernel build I just looked at has strlen() aligned to a 16 byte
boundary with the branch crossing the next 16 byte boundary.
So, if the same is true as in my test program, strlen() will run a
lot slower on 50% of kernel builds.
(And other cpu may have costs associated with the 16 byte boundary.)

Mostly this means that however hard you try you are guaranteed to
lose somewhere :-(

>
> Using 'byte masking' is faster for longer strings - the break-even point
> is around 56 bytes on the same Zen-5 (there is much larger overhead, then
> it runs at 16 bytes in 3 clocks).
> But the majority of kernel calls won't be near that length.
> There will also be extra overhead for big-endian systems and those
> without a fast ffs().

I've had a further thought on that as well.

The 'byte masking' code is somewhat larger (112 rather than 32 or 48).
While the extra overhead is ~20 clocks, that is less than a 'branch
mispredict' penalty that the byte loop suffers every time the length
changes.
So for randomly changing lengths I'm beginning to think the 'byte mask'
version is better.

I ran the code on a Haswell a while back, the break even length was
also somewhat shorter (I'm remembering 32 bytes).

This all means the byte masking code may actually be sensible provided.
- LE or BE with byte swapping memory read.
- fast ffsl()
- 64bit

David