Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation

From: Yury Norov
Date: Fri Oct 28 2022 - 14:40:06 EST


+ Alexey Klimov

On Fri, Oct 28, 2022 at 06:45:50PM +0100, Russell King (Oracle) wrote:
> On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote:
> > On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
> > <rmk+kernel@xxxxxxxxxxxxxxx> wrote:
> > >
> > > Document the ARMv5 bit offset calculation code.
> >
> > Hmm. Don't the generic bitop functions end up using this? We do have a
> > comment in the code that says
> >
> > * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
> > * and produce optimal inlined code in all cases. On ARMv7 it is even
> > * better by also using the rbit instruction.
>
> It's true that the generic code also makes use of the rbit and clz
> instructions - but in terms of the speed of the functions these only
> get used once we've found a word that is interesting to locate the
> bit we want in.
>
> > but that 'may' makes me wonder...
> >
> > IOW, what is it in the hand-written code that doesn't get done by the
> > generic code these days?
>
> For the _find_first_bit, there isn't much difference in the number
> of instructions or really what is going on, only the organisation
> and flow of the code is more inline - but that shouldn't make much
> of a difference. Yet, there is a definite repeatable measurable
> difference between the two:
>
> random-filled:
> arm : find_first_bit: 17778911 ns, 16448 iterations
> generic: find_first_bit: 18596622 ns, 16401 iterations
>
> sparse:
> arm : find_first_bit: 7301363 ns, 656 iterations
> generic: find_first_bit: 7589120 ns, 655 iterations
>
> The bigger difference is in the find_next_bit operations, and this
> likely comes from the arm32 code not having the hassles of the "_and"
> and other conditionals that the generic code has:
>
> random-filled:
> arm : find_next_bit: 2242618 ns, 163949 iterations
> generic: find_next_bit: 2632859 ns, 163743 iterations
>
> sparse:
> arm : find_next_bit: 40078 ns, 656 iterations
> generic: find_next_bit: 69943 ns, 655 iterations
>
> find_next_zero_bit show a greater difference:
>
> random-filled:
> arm : find_next_zero_bit: 2049129 ns, 163732 iterations
> generic: find_next_zero_bit: 2844221 ns, 163938 iterations
>
> sparse:
> arm : find_next_zero_bit: 3939309 ns, 327025 iterations
> generic: find_next_zero_bit: 5529553 ns, 327026 iterations

Those numbers disagree with what Alexey has measured on Odroid board
for A15 but somewhat in line with what he had for A7:

https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/

Can you please share details about your methodology: what CPU did you use;
did you lock cpu frequencies; in addition to mean, can you also show
standard deviation, raw logs?

To Alexey: if you have time, can you please repeat those measurements
on top of v6 + this series? This would help a lot in understanding how
new code performs, both generic and arch.

If generic vs arch code comparison looks different for different CPU
versions, what should we prefer?

> Here's the disassemblies for comparison. Note that the arm versions
> share code paths between the functions which makes the code even more
> compact - so the loop in the find_first gets re-used for find_next
> after we check the current word.
>
> generic:

[...]

> 000000e8 <_find_next_bit>:
> e8: e92d4070 push {r4, r5, r6, lr}
> ec: e1530002 cmp r3, r2
> f0: e59d4010 ldr r4, [sp, #16]
> f4: e59d5014 ldr r5, [sp, #20]
> f8: 2a000024 bcs 190 <_find_next_bit+0xa8>
> fc: e1a0e2a3 lsr lr, r3, #5
> 100: e3510000 cmp r1, #0
> 104: e203601f and r6, r3, #31
> 108: e3c3301f bic r3, r3, #31
> 10c: e790c10e ldr ip, [r0, lr, lsl #2]
> 110: 1791e10e ldrne lr, [r1, lr, lsl #2]
> 114: 100cc00e andne ip, ip, lr
> 118: e3e0e000 mvn lr, #0
> 11c: e02cc004 eor ip, ip, r4
> 120: e3550000 cmp r5, #0
> 124: e1a0e61e lsl lr, lr, r6
> 128: 1a00001a bne 198 <_find_next_bit+0xb0>
> 12c: e01cc00e ands ip, ip, lr
> 130: 1a000011 bne 17c <_find_next_bit+0x94>
> 134: e2833020 add r3, r3, #32
> 138: e1530002 cmp r3, r2
> 13c: 3a000003 bcc 150 <_find_next_bit+0x68>
> 140: ea000012 b 190 <_find_next_bit+0xa8>
> 144: e2833020 add r3, r3, #32
> 148: e1520003 cmp r2, r3
> 14c: 9a00000f bls 190 <_find_next_bit+0xa8>
> 150: e1a0e2a3 lsr lr, r3, #5
> 154: e3510000 cmp r1, #0
> 158: e790c10e ldr ip, [r0, lr, lsl #2]
> 15c: 1791e10e ldrne lr, [r1, lr, lsl #2]
> 160: 100cc00e andne ip, ip, lr
> 164: e15c0004 cmp ip, r4
> 168: 0afffff5 beq 144 <_find_next_bit+0x5c>
> 16c: e02cc004 eor ip, ip, r4
> 170: e3550000 cmp r5, #0
> 174: 0a000000 beq 17c <_find_next_bit+0x94>
> 178: e6bfcf3c rev ip, ip
> 17c: e6ffcf3c rbit ip, ip
> 180: e16fcf1c clz ip, ip
> 184: e08c3003 add r3, ip, r3
> 188: e1520003 cmp r2, r3
> 18c: 21a02003 movcs r2, r3
> 190: e1a00002 mov r0, r2
> 194: e8bd8070 pop {r4, r5, r6, pc}
> 198: e6bfef3e rev lr, lr
> 19c: e01cc00e ands ip, ip, lr
> 1a0: 0affffe3 beq 134 <_find_next_bit+0x4c>
> 1a4: eafffff3 b 178 <_find_next_bit+0x90>

On top of master, my generic _find_next_bit looks different:

000000e4 <_find_next_bit>:
e4: e1510002 cmp r1, r2
e8: e1a0c000 mov ip, r0
ec: e1a00001 mov r0, r1
f0: 912fff1e bxls lr
f4: e1a012a2 lsr r1, r2, #5
f8: e92d4010 push {r4, lr}
fc: e202201f and r2, r2, #31
100: e3e03000 mvn r3, #0
104: e79ce101 ldr lr, [ip, r1, lsl #2]
108: e01ee213 ands lr, lr, r3, lsl r2
10c: 1a000012 bne 15c <_find_next_bit+0x78>
110: e2813001 add r3, r1, #1
114: e1a04283 lsl r4, r3, #5
118: e1540000 cmp r4, r0
11c: 28bd8010 popcs {r4, pc}
120: e08c2101 add r2, ip, r1, lsl #2
124: ea000002 b 134 <_find_next_bit+0x50>
128: e1a04283 lsl r4, r3, #5
12c: e1500004 cmp r0, r4
130: 98bd8010 popls {r4, pc}
134: e5b2e004 ldr lr, [r2, #4]!
138: e2833001 add r3, r3, #1
13c: e35e0000 cmp lr, #0
140: 0afffff8 beq 128 <_find_next_bit+0x44>
144: e6ffef3e rbit lr, lr
148: e16fef1e clz lr, lr
14c: e084400e add r4, r4, lr
150: e1500004 cmp r0, r4
154: 21a00004 movcs r0, r4
158: e8bd8010 pop {r4, pc}
15c: e1a04281 lsl r4, r1, #5
160: eafffff7 b 144 <_find_next_bit+0x60>

Are you sure you're running latest kernel?

Thanks,
Yury