Re: [PATCH 1/2] x86/asm/bitops: ffs: use __builtin_ffs to evaluate constant expressions

From: Nick Desaulniers
Date: Tue May 10 2022 - 18:30:02 EST


On Tue, May 10, 2022 at 7:26 AM Vincent Mailhol
<mailhol.vincent@xxxxxxxxxx> wrote:
>
> For x86_64, the current ffs() implementation does not produce
> optimized code when called with a constant expression. On the
> contrary, the __builtin_ffs() function of both GCC and clang is able
> to simplify the expression into a single instruction.
>
> * Example *
>
> Let's consider two dummy functions foo() and bar() as below:
>
> | #include <linux/bitops.h>
> | #define CONST 0x01000000
> |
> | unsigned int foo(void)
> | {
> | return ffs(CONST);
> | }
> |
> | unsigned int bar(void)
> | {
> | return __builtin_ffs(CONST);
> | }
>
> GCC would produce below assembly code:
>
> | 0000000000000000 <foo>:
> | 0: b8 ff ff ff ff mov $0xffffffff,%eax
> | 5: 0f bc c7 bsf %edi,%eax
> | 8: 83 c0 01 add $0x1,%eax
> | b: c3 ret
> | c: 0f 1f 40 00 nopl 0x0(%rax)
> |
> | 0000000000000010 <bar>:
> | 10: b8 19 00 00 00 mov $0x19,%eax
> | 15: c3 ret
>
> And clang would produce:
>
> | 0000000000000000 <foo>:
> | 0: 55 push %rbp
> | 1: 48 89 e5 mov %rsp,%rbp
> | 4: b8 ff ff ff ff mov $0xffffffff,%eax
> | 9: 0f bc 05 00 00 00 00 bsf 0x0(%rip),%eax # 10 <foo+0x10>
> | 10: ff c0 inc %eax
> | 12: 5d pop %rbp
> | 13: c3 ret
> | 14: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
> | 1b: 00 00 00
> | 1e: 66 90 xchg %ax,%ax
> |
> | 0000000000000020 <bar>:
> | 20: 55 push %rbp
> | 21: 48 89 e5 mov %rsp,%rbp
> | 24: b8 19 00 00 00 mov $0x19,%eax
> | 29: 5d pop %rbp
> | 2a: c3 ret

Right, we need to allocate registers to move the inputs into the asm
block, and the results back out. Inline asm is analogous to a call
with a custom calling convention, where we don't look into the body of
the inline asm.

Does -fomit-frame-pointer clean make these snippets clearer, or did
you not build with -O2? Consider using those flags if so, since we
generally prefer the ORC unwinder on x86, not the frame pointer
unwinder. If the compilers are forcing a frame pointer when using the
builtins once optimizations are enabled, that's a problem (that we've
seen in the past with the builtins for reading eflags with clang; now
fixed).

>
> For both examples, we clearly see the benefit of using __builtin_ffs()
> instead of the kernel's asm implementation for constant expressions.
>
> However, for non constant expressions, the ffs() asm version of the
> kernel remains better for x86_64 because, contrary to GCC, it doesn't
> emit the CMOV assembly instruction, c.f. [1] (noticeably, clang is
> able optimize out the CMOV call).
>
> This patch uses the __builtin_constant_p() to select between the
> kernel's ffs() and the __builtin_ffs() depending on whether the
> argument is constant or not.
>
>
> As a side benefit, this patch also removes below -Wshadow warning:
>
> | ./arch/x86/include/asm/bitops.h:283:28: warning: declaration of 'ffs' shadows a built-in function [-Wshadow]
> | 283 | static __always_inline int ffs(int x)

Nice! :)

>
> [1] commit ca3d30cc02f7 ("x86_64, asm: Optimise fls(), ffs() and fls64()")
> http://lkml.kernel.org/r/20111213145654.14362.39868.stgit@xxxxxxxxxxxxxxxxxxxxxx

+ David, author of ca3d30cc02f7. I was wondering if this applied to
more than just x86, but I see now that some architectures just include
include/asm-generic/bitops/builtin-ffs.h into their
arch/*/include/asm/bitops.h. It's only when we want to beat the
compiler for non-ICE expressions.

Patch LGTM; just minor comments on commit message, naming, and formatting.

>
>
> Signed-off-by: Vincent Mailhol <mailhol.vincent@xxxxxxxxxx>
> ---
> arch/x86/include/asm/bitops.h | 29 +++++++++++++++++------------
> 1 file changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index a288ecd230ab..535a7a358c14 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -269,18 +269,7 @@ static __always_inline unsigned long __fls(unsigned long word)
> #undef ADDR
>
> #ifdef __KERNEL__
> -/**
> - * ffs - find first set bit in word
> - * @x: the word to search
> - *
> - * This is defined the same way as the libc and compiler builtin ffs
> - * routines, therefore differs in spirit from the other bitops.
> - *
> - * ffs(value) returns 0 if value is 0 or the position of the first
> - * set bit if value is nonzero. The first (least significant) bit
> - * is at position 1.
> - */
> -static __always_inline int ffs(int x)
> +static __always_inline int __ffs_asm(int x)

How about variable_ffs rather than __ffs_asm? Let's try to stick with
the convention used by test_bit?

> {
> int r;
>
> @@ -310,6 +299,22 @@ static __always_inline int ffs(int x)
> return r + 1;
> }
>
> +/**
> + * ffs - find first set bit in word
> + * @x: the word to search
> + *
> + * This is defined the same way as the libc and compiler builtin ffs
> + * routines, therefore differs in spirit from the other bitops.
> + *
> + * ffs(value) returns 0 if value is 0 or the position of the first
> + * set bit if value is nonzero. The first (least significant) bit
> + * is at position 1.
> + */
> +#define ffs(x) \
> + (__builtin_constant_p(x) ? \
> + __builtin_ffs(x) : \
> + __ffs_asm(x))
> +

I think this whole #define can fit on one line? If not, perhaps the
BCP can start on the initial line? Otherwise it looks like the
then/else clauses are indented by 1 tab followed by 1 space. Consider
just using tabs.

> /**
> * fls - find last set bit in word
> * @x: the word to search
> --
> 2.35.1
>


--
Thanks,
~Nick Desaulniers