Re: [PATCH v3] lib: optimize cpumask_next_and()

From: Yury Norov
Date: Sat Oct 28 2017 - 09:25:00 EST


On Thu, Oct 26, 2017 at 02:58:00PM +0200, Alexey Dobriyan wrote:
> > - Refactored _find_next_common_bit into _find_next_bit., as suggested
> > by Yury Norov. This has no adverse effects on the performance side,
> > as the compiler successfully inlines the code.
>
> 1)
> Gentoo ships 5.4.0 which doesn't inline this code on x86_64 defconfig
> (which has OPTIMIZE_INLINING).
>
>
> ffffffff813556c0 <find_next_bit>:
> ffffffff813556c0: 55 push rbp
> ffffffff813556c1: 48 89 d1 mov rcx,rdx
> ffffffff813556c4: 45 31 c0 xor r8d,r8d
> ffffffff813556c7: 48 89 f2 mov rdx,rsi
> ffffffff813556ca: 31 f6 xor esi,esi
> ffffffff813556cc: 48 89 e5 mov rbp,rsp
> ffffffff813556cf: e8 7c ff ff ff call
> ffffffff81355650 <_find_next_bit>
> ffffffff813556d4: 5d pop rbp
> ffffffff813556d5: c3 ret

GCC 7 for ARM64 doesn't inline as well. I wrote test for it to measure
the effect of inlining:
http://www.spinics.net/lists/kernel/msg2635338.html

The performance impact of this patch without inlining:

Before:
[ 96.856195] Start testing find_bit() with random-filled bitmap
[ 96.868322] find_next_bit: 34529 cycles, 16304 iterations
[ 96.879525] find_next_zero_bit: 35771 cycles, 16465 iterations
[ 96.891409] find_last_bit: 17444 cycles, 16304 iterations
[ 96.914445] find_first_bit: 1219671 cycles, 16305 iterations
[ 96.925802] Start testing find_bit() with sparse bitmap
[ 96.936308] find_next_bit: 301 cycles, 66 iterations
[ 96.946981] find_next_zero_bit: 70897 cycles, 32703 iterations
[ 96.958694] find_last_bit: 286 cycles, 66 iterations
[ 96.968710] find_first_bit: 5260 cycles, 66 iterations

After:
[ 116.205594] Start testing find_bit() with random-filled bitmap
[ 116.217621] find_next_bit: 24979 cycles, 16449 iterations
[ 116.228719] find_next_zero_bit: 25666 cycles, 16320 iterations
[ 116.240620] find_last_bit: 19407 cycles, 16449 iterations
[ 116.268368] find_first_bit: 1690945 cycles, 16449 iterations
[ 116.279718] Start testing find_bit() with sparse bitmap
[ 116.290219] find_next_bit: 352 cycles, 66 iterations
[ 116.300692] find_next_zero_bit: 50916 cycles, 32703 iterations
[ 116.312400] find_last_bit: 295 cycles, 66 iterations
[ 116.322427] find_first_bit: 6742 cycles, 66 iterations

And with inlining:

Before:
[ 169.464229] Start testing find_bit() with random-filled bitmap
[ 169.476191] find_next_bit: 17520 cycles, 16336 iterations
[ 169.487210] find_next_zero_bit: 17622 cycles, 16433 iterations
[ 169.499111] find_last_bit: 19272 cycles, 16335 iterations
[ 169.519735] find_first_bit: 978657 cycles, 16337 iterations
[ 169.530912] Start testing find_bit() with sparse bitmap
[ 169.541414] find_next_bit: 252 cycles, 66 iterations
[ 169.551726] find_next_zero_bit: 34554 cycles, 32703 iterations
[ 169.563436] find_last_bit: 294 cycles, 66 iterations
[ 169.573439] find_first_bit: 3964 cycles, 66 iterations

After
[ 191.191170] Start testing find_bit() with random-filled bitmap
[ 191.203133] find_next_bit: 17530 cycles, 16346 iterations
[ 191.214150] find_next_zero_bit: 17630 cycles, 16423 iterations
[ 191.226037] find_last_bit: 17489 cycles, 16347 iterations
[ 191.246672] find_first_bit: 979961 cycles, 16347 iterations
[ 191.257849] Start testing find_bit() with sparse bitmap
[ 191.268351] find_next_bit: 257 cycles, 66 iterations
[ 191.278660] find_next_zero_bit: 34547 cycles, 32703 iterations
[ 191.290370] find_last_bit: 292 cycles, 66 iterations
[ 191.300376] find_first_bit: 4269 cycles, 66 iterations

I didn't investigate why non-inlined version of this patch works
faster than vanilla code, but inlined one is even faster and is
as fast as inlined version of existing code. I think, we should
come with it finally.

It would be great if someone test it on x86.

> 2)
> Making "and" operation to be centerpiece of this code is kind of meh
> find_next_or_bit() will be hard to implement.

Not so hard actually. :)
https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1521775.html

Yury